arXiv:2601.20221v1 Announce Type: new Abstract: Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and […]
GNN Explanations that do not Explain and How to find Them
arXiv:2601.20815v1 Announce Type: cross Abstract: Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model’s inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we […]
Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control
arXiv:2601.20090v1 Announce Type: new Abstract: Large language model (LLM)-powered agents can translate high-level user intents into plans and actions in an environment. Yet after observing an outcome, users may wonder: What if I had phrased my intent differently? We introduce a framework that enables such counterfactual reasoning in agentic LLM-driven control scenarios, while providing formal […]
Evolutionary Strategies lead to Catastrophic Forgetting in LLMs
arXiv:2601.20861v1 Announce Type: cross Abstract: One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged […]
Insight Agents: An LLM-Based Multi-Agent System for Data Insights
arXiv:2601.20048v1 Announce Type: new Abstract: Today, E-commerce sellers face several key challenges, including difficulties in discovering and effectively utilizing available programs and tools, and struggling to understand and utilize rich data from various tools. We therefore aim to develop Insight Agents (IA), a conversational multi-agent Data Insight system, to provide E-commerce sellers with personalized data […]
Mind the Gap: The Divergence Between Human and LLM-Generated Tasks
arXiv:2508.00282v3 Announce Type: replace Abstract: Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those […]
Fuzzy Categorical Planning: Autonomous Goal Satisfaction with Graded Semantic Constraints
arXiv:2601.20021v1 Announce Type: new Abstract: Natural-language planning often involves vague predicates (e.g., suitable substitute, stable enough) whose satisfaction is inherently graded. Existing category-theoretic planners provide compositional structure and pullback-based hard-constraint verification, but treat applicability as crisp, forcing thresholding that collapses meaningful distinctions and cannot track quality degradation across multi-step plans. We propose Fuzzy Category-theoretic Planning […]
Neural Value Iteration
arXiv:2511.08825v2 Announce Type: replace Abstract: The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as $alpha$-vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on $alpha$-vectors at reachable belief points until convergence. However, since […]
The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models
arXiv:2512.23340v2 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM are inherently bounded. One solution originates from intricate interactions among multiple LLMs, rendering their collective […]
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
arXiv:2601.10402v3 Announce Type: replace Abstract: The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution […]
ECG-Agent: On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue
arXiv:2601.20323v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models have rapidly expanded to electrocardiograms, focusing on classification, report generation, and single-turn QA tasks. However, these models fall short in real-world scenarios, lacking multi-turn conversational ability, on-device efficiency, and precise understanding of ECG measurements such as the PQRST intervals. To address these limitations, […]
GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding
arXiv:2402.15769v3 Announce Type: replace-cross Abstract: Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance […]