March 31, 2026 – Page 23 – dijee Pharma Intelligence

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

arXiv:2603.28407v1 Announce Type: new Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, […]

March 31, 2026

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

arXiv:2603.26996v1 Announce Type: new Abstract: We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced […]

March 31, 2026

Global stability and uniform persistence in an epidemic model with saturating fomite-mediated transmission

arXiv:2603.28285v1 Announce Type: cross Abstract: We analyse the global dynamics of a Susceptible–Vaccinated–Exposed–Infected–Recovered (SVEIR) epidemic model with demographic turnover, imperfect vaccination, and two transmission routes: direct host-to-host contagion and indirect transmission via contaminated fomites. Indirect transmission is described through an environmental pathogen concentration and a Holling-type dose–response function, accounting for nonlinear incidence at high contamination […]

March 31, 2026

Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning

arXiv:2603.28651v1 Announce Type: new Abstract: With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The fundamental reason is that current work on academic paper reasoning is largely confined […]

March 31, 2026

Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

arXiv:2603.28561v1 Announce Type: cross Abstract: The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language […]

March 31, 2026

Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies

arXiv:2603.26673v1 Announce Type: cross Abstract: There are growing promises that Large Language Models (LLMs) can support students’ learning by providing explanations, feedback, and guidance. However, despite their rapid adoption and widespread attention, there is still limited empirical evidence regarding the pedagogical skills of LLMs. This article presents a comparative study of popular LLMs, namely, ChatGPT, […]

March 31, 2026

Limits of Imagery Reasoning in Frontier LLM Models

arXiv:2603.26779v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external “Imagery Module” — a tool capable of rendering and rotating 3D models — can bridge this gap, […]

March 31, 2026

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

arXiv:2602.16898v4 Announce Type: replace-cross Abstract: Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language […]

March 31, 2026

When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

arXiv:2603.27076v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner’s current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with […]

March 31, 2026

Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

arXiv:2603.27343v1 Announce Type: new Abstract: Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, […]

March 31, 2026

Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption

arXiv:2603.26769v1 Announce Type: cross Abstract: The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category […]

March 31, 2026

From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics

arXiv:2603.26772v1 Announce Type: cross Abstract: Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This […]

March 31, 2026

Subscribe for Updates