Patient and clinician perceptions, expectations, and usability of ankle exoskeletons for daily living: a mixed-methods survey study

Ankle exoskeletons offer promising support for individuals with chronic foot drop, yet user and clinician perspectives on their use in daily living remain underexplored. Related

Development of reconfigurable smart medical wards using integrated components and complex features

Patient treatment in hospitals requires their regular monitoring to assess their health conditions. At the same time, routine measurements are often delayed, missed, or not

Why digital health fails silently: a sociotechnical theory of health information technology–related risk

IntroductionHealth information technology (HIT) is now integral to healthcare delivery, supporting clinical documentation, prescribing, diagnostics, and care coordination. Although these technologies offer substantial benefits, they

Portable automated rapid testing for auditory assessment: repeated at-home testing in older adults

IntroductionHearing challenges are prevalent in older adults and are associated with age-related cognitive decline. However, measuring age-related changes in hearing faces critical barriers related to

Why health information technology safety problems remain invisible

Post Content

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

May 21, 2026

arXiv:2605.10787v2 Announce Type: replace
Abstract: Current LLM agents are proficient at calling isolated APIs but struggle with the “last mile” of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $textbfComplexMCP$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $textbfComplexMCP$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation.
We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $textbftool retrieval saturation$ as action spaces scale; (2) $textbfover-confidence$, where agents skip essential environment verifications; and (3) $textbfstrategic defeatism$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $textbfComplexMCP$ as a critical testbed for the next generation of resilient autonomous systems.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844