arXiv:2606.04922v1 Announce Type: cross Abstract: Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision […]
memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations
arXiv:2606.01138v2 Announce Type: replace-cross Abstract: Agent-memory frameworks — mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor — each ship their own SDK, storage layout, and operational vocabulary. There is no shared wire format: every integration is bespoke, every migration rebuilds memory from scratch, and no framework ships a governance surface that lets a human review writes before […]
Invariant Gradient Alignment for Robust Reasoning Distillation
arXiv:2606.05025v1 Announce Type: cross Abstract: Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermines knowledge distillation pipelines that transfer chain-of-thought reasoning to smaller students. We introduce Invariant Gradient Alignment (IGA), a training framework […]
‘Your AI Text is not Mine’: Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions
arXiv:2606.04906v1 Announce Type: cross Abstract: Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely […]
Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
arXiv:2506.10912v4 Announce Type: replace Abstract: Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark […]
SciDER: Scientific Data-centric End-to-end Researcher
arXiv:2603.01421v3 Announce Type: replace Abstract: While large language models accelerate scientific discovery, existing agents face severe limitations in adaptability, domain generalization, and multimodal scalability, often struggling to autonomously process raw, domain-specific experimental data. To overcome these barriers, we introduce SciDER, a multi-agent system designed to flexibly automate the entire research lifecycle. This framework employs a […]
Provably Auditable and Safe LLM Agents from Human-Authored Ontologies
arXiv:2606.04903v1 Announce Type: cross Abstract: We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present […]
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
arXiv:2606.01961v2 Announce Type: replace Abstract: Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark […]
AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis
arXiv:2606.02403v2 Announce Type: replace-cross Abstract: Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragmented and labour-intensive process. Researchers must interpret complex clinical texts, manually extract outcome data from trials, define appropriate interventions and comparators, harmonise inconsistent study designs, and carry out meta-analytic computations-typically using specialised […]
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
arXiv:2506.05233v2 Announce Type: replace-cross Abstract: Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as […]
DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance
arXiv:2606.04881v1 Announce Type: cross Abstract: Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient […]
OckBench: Measuring the Efficiency of LLM Reasoning
arXiv:2511.05722v3 Announce Type: replace-cross Abstract: Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with […]