arXiv:2603.28407v1 Announce Type: new Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, […]
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
arXiv:2603.26996v1 Announce Type: new Abstract: We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced […]
Global stability and uniform persistence in an epidemic model with saturating fomite-mediated transmission
arXiv:2603.28285v1 Announce Type: cross Abstract: We analyse the global dynamics of a Susceptible–Vaccinated–Exposed–Infected–Recovered (SVEIR) epidemic model with demographic turnover, imperfect vaccination, and two transmission routes: direct host-to-host contagion and indirect transmission via contaminated fomites. Indirect transmission is described through an environmental pathogen concentration and a Holling-type dose–response function, accounting for nonlinear incidence at high contamination […]
Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning
arXiv:2603.28651v1 Announce Type: new Abstract: With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on academic paper reasoning is largely confined […]
Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems
arXiv:2603.28561v1 Announce Type: cross Abstract: The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language […]
Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies
arXiv:2603.26673v1 Announce Type: cross Abstract: There are growing promises that Large Language Models (LLMs) can support students’ learning by providing explanations, feedback, and guidance. However, despite their rapid adoption and widespread attention, there is still limited empirical evidence regarding the pedagogical skills of LLMs. This article presents a comparative study of popular LLMs, namely, ChatGPT, […]
Limits of Imagery Reasoning in Frontier LLM Models
arXiv:2603.26779v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external “Imagery Module” — a tool capable of rendering and rotating 3D models — can bridge this gap, […]
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
arXiv:2602.16898v4 Announce Type: replace-cross Abstract: Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language […]
When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring
arXiv:2603.27076v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner’s current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with […]
Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance
arXiv:2603.27343v1 Announce Type: new Abstract: Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, […]
Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
arXiv:2603.26769v1 Announce Type: cross Abstract: The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category […]
From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics
arXiv:2603.26772v1 Announce Type: cross Abstract: Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This […]