• Home
  • Uncategorized
  • ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

arXiv:2605.10787v2 Announce Type: replace
Abstract: Current LLM agents are proficient at calling isolated APIs but struggle with the “last mile” of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $textbfComplexMCP$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $textbfComplexMCP$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation.
We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $textbftool retrieval saturation$ as action spaces scale; (2) $textbfover-confidence$, where agents skip essential environment verifications; and (3) $textbfstrategic defeatism$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $textbfComplexMCP$ as a critical testbed for the next generation of resilient autonomous systems.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844