Behavior change beyond intervention: an activity-theoretical perspective on human-centered design of personal health technology

IntroductionModern personal technologies, such as smartphone apps with artificial intelligence (AI) capabilities, have a significant potential for helping people make necessary changes in their behavior

A data-centric perspective on designing AI foundation models for healthcare

Post Content

HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

arXiv:2604.18791v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not

Conjuring Semantic Similarity

arXiv:2410.16431v4 Announce Type: replace Abstract: The semantic similarity between sample expressions measures the distance between their latent ‘meaning’. These meanings are themselves typically represented by

REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

arXiv:2604.18757v1 Announce Type: cross Abstract: The retina provides a unique, noninvasive window into Alzheimer’s disease (AD) and dementia, capturing early structural changes through morphometric features,

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

April 22, 2026

arXiv:2512.22673v3 Announce Type: replace
Abstract: Travel planning is a natural real-world task to test large language models’ (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users’ implicit preferences in multi-turn conversations, and a lack of evaluation of agents’ capability boundaries. To mitigate these gaps, we propose $textbfTravelBench$, a benchmark for $textittruly real-world$ travel planning. We collect user queries, user preferences, and tools from real scenarios, and construct three subtasks — $textitSingle-Turn$, $textitMulti-Turn$, and $textitUnsolvable$ — to evaluate agents’ three core capabilities in real settings: (1) solving problems independently, (2) interacting with users to elicit implicit preferences, and (3) recognizing the capability boundaries. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment which integrates ten travel-related tools, enabling agents to combine these tools to solve most practical travel planning problems. We evaluate multiple LLMs on TravelBench and find that even advanced models exhibit imbalanced performance across different capabilities. Our further systematic verification demonstrates the stability of the proposed benchmark. TravelBench provides a practical and reproducible benchmark to advance research on LLM agents for real-world travel planning.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844