arXiv:2603.19146v2 Announce Type: replace Abstract: Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard autoregressive search procedures, such as beam search, do not directly apply to iterative denoising, where hypotheses are complete intermediate sequences rather than left-to-right prefixes. Furthermore, existing diffusion decoding procedures only provide […]
PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance
arXiv:2606.06823v1 Announce Type: cross Abstract: While deep learning has excelled in various domains, its application to sequential decision-making in finance remains challenging due to the low Signal-to-Noise Ratio (SNR) and non-stationarity of financial data. Leveraging the reasoning capabilities of Large Language Models (LLMs), we propose textbfPandaAI, a closed-loop neuro-symbolic LLM agent with market regime modeling […]
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
arXiv:2606.07017v1 Announce Type: new Abstract: Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and […]
Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks
arXiv:2606.06833v1 Announce Type: cross Abstract: Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal constraint serves as an information bottleneck on attackers, significantly limiting attack performance. Our new Semantic Gambit attack breaks this causal limitation by […]
Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning
arXiv:2605.12655v2 Announce Type: replace Abstract: Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when […]
Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation
arXiv:2606.06836v1 Announce Type: cross Abstract: Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce textbfFLIGHT, […]
Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation
arXiv:2603.24576v2 Announce Type: replace-cross Abstract: Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough […]
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines
arXiv:2605.25645v3 Announce Type: replace-cross Abstract: We present the first end-to-end demonstration of fine-tuning and serving Google’s Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the […]
AI-Driven Test Case Generation from Natural Language Requirements: A Survey of Techniques and Research Gaps
arXiv:2606.06563v1 Announce Type: cross Abstract: Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensive activities in development. Requirements-based test generation allows test cases to be derived early from requirements artifacts, but generating them directly from natural language is challenging due to inherent ambiguity and imprecision. […]
REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference
arXiv:2606.07141v1 Announce Type: cross Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning patient-specific data is intractable, and retraining with minor […]
CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval
arXiv:2506.11066v3 Announce Type: replace-cross Abstract: Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across […]
Measuring Agents in Production
arXiv:2512.04123v4 Announce Type: replace-cross Abstract: LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 […]