arXiv:2604.04552v2 Announce Type: replace-cross Abstract: Ensemble methods are widely used to improve predictive performance, but their effectiveness often comes at the cost of increased memory usage and computational complexity. In this paper, we identify a conflict in aggregation strategies that negatively impacts prediction stability. We propose test-time adaptation (StableTTA), a training-free method employs novel image […]
Probabilistic Prediction of Neural Dynamics via Autoregressive Flow Matching
arXiv:2604.11178v1 Announce Type: new Abstract: Forecasting neural activity in response to naturalistic stimuli remains a key challenge for understanding brain dynamics and enabling downstream neurotechnological applications. Here, we introduce a generative forecasting framework for modeling neural dynamics based on autoregressive flow matching (AFM). Building on recent advances in transport-based generative modeling, our approach probabilistically predicts […]
Perceived Importance of Cognitive Skills Among Computing Students in the Era of AI
arXiv:2604.10730v1 Announce Type: cross Abstract: The availability and increasing integration of generative AI tools have transformed computing education. While AI in education presents opportunities, it also raises new concerns about how these powerful know-it-all AI tools, which are becoming widespread, impact cognitive skill development among students. Cognitive skills are essential for academic success and professional […]
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
arXiv:2604.11328v1 Announce Type: new Abstract: Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking […]
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
arXiv:2604.03515v2 Announce Type: replace-cross Abstract: LLM-based coding agents can localize bugs, generate patches, and run tests with diminishing human oversight, yet the scaffolding code that surrounds the language model (the control loop, tool definitions, state management, and context strategy) remains poorly understood. Existing surveys classify agents by abstract capabilities (tool use, planning, reflection) that cannot […]
Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
arXiv:2604.11462v1 Announce Type: new Abstract: Large Language Models (LLMs) struggle with long-horizon tasks due to the “context bottleneck” and the “lost-in-the-middle” phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized […]
Tail-Aware Information-Theoretic Generalization for RLHF and SGLD
arXiv:2604.10727v1 Announce Type: cross Abstract: Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many modern pipelines, such as robust learning, RLHF, and stochastic optimization, losses and rewards can be heavy-tailed, and MGFs may not exist, […]
PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints
arXiv:2604.11523v1 Announce Type: new Abstract: We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present $PACtext-Bench$, a benchmark for systematic evaluation of multi-agent collaboration under privacy […]
Zero-Shot Quantization via Weight-Space Arithmetic
arXiv:2604.03420v2 Announce Type: replace-cross Abstract: We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points […]
Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models
arXiv:2604.10681v1 Announce Type: cross Abstract: Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, […]
Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization
arXiv:2604.10721v1 Announce Type: cross Abstract: Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning […]
Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation
arXiv:2604.10511v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness — whether the empirical […]