arXiv:2604.04532v1 Announce Type: cross Abstract: Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge’s language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three […]
Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity
arXiv:2604.04692v1 Announce Type: cross Abstract: Automated fact-checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In […]
A Quantum Search Approach to Magic Square Constraint Problems with Classical Benchmarking
arXiv:2604.04786v1 Announce Type: cross Abstract: This paper presents a quantum search approach to combinatorial constraint satisfaction problems, demonstrated through the generation of magic squares. We reformulate magic square construction as a quantum search problem in which a reversible, constraint-sensitive oracle marks valid configurations for amplitude amplification via Grover’s algorithm. Classical pre-processing using the Siamese construction […]
Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics
arXiv:2510.09901v2 Announce Type: replace Abstract: Computing has long served as a cornerstone of scientific discovery. Recently, a paradigm shift has emerged with the rise of large language models (LLMs), introducing autonomous systems, referred to as agents, that accelerate discovery across varying levels of autonomy. These language agents provide a flexible and versatile framework that orchestrates […]
KLong: Training LLM Agent for Extremely Long-horizon Tasks
arXiv:2602.17547v2 Announce Type: replace Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we […]
RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
arXiv:2504.18594v3 Announce Type: replace-cross Abstract: Compared to untargeted attacks, targeted transfer-based attack is still suffering from much lower Attack Success Rates (ASRs), although significant improvements have been achieved by kinds of methods, such as diversifying input, stabilizing the gradient, and re-training surrogate models. In this paper, we find that adversarial examples generated by existing methods […]
CATNet: A geometric deep learning approach for CAT bond spread prediction in the primary market
arXiv:2508.10208v2 Announce Type: replace-cross Abstract: Traditional models for pricing catastrophe (CAT) bonds struggle to capture the complex, relational data inherent in these instruments. This paper introduces CATNet, a novel framework that applies a geometric deep learning architecture, the Relational Graph Convolutional Network (R-GCN), to model the CAT bond primary market as a graph, leveraging its […]
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
arXiv:2510.15148v2 Announce Type: replace-cross Abstract: Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal […]
Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
arXiv:2512.18388v2 Announce Type: replace-cross Abstract: Generative AI has democratized content creation, but popular chatbot-based interfaces often prioritize execution, generating fully rendered artifacts right away. This issue can lead to premature convergence and design fixation, where users are being anchored to initial outputs. Recent works have proposed new interfaces to address this issue by supporting exploration, […]
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
arXiv:2602.08392v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced the landscape of embodied AI, yet transitioning to synchronized bimanual coordination introduces formidable challenges in multi-stream multimodal integration. We introduce ST-BiBench, a comprehensive multi-tier framework for evaluating spatio-temporal multimodal coordination. Our approach centers on Strategic Coordination Planning, assessing high-level cross-modal reasoning over […]
UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization
arXiv:2603.11583v3 Announce Type: replace-cross Abstract: The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct […]
Toward Evaluation Frameworks for Multi-Agent Scientific AI Systems
arXiv:2603.26718v2 Announce Type: replace-cross Abstract: We analyze the challenges of benchmarking scientific (multi)-agentic systems, including the difficulty of distinguishing reasoning from retrieval, the risks of data/model contamination, the lack of reliable ground truth for novel research problems, the complications introduced by tool use, and the replication challenges due to the continuously changing/updating knowledge base. We […]