arXiv:2605.18840v1 Announce Type: cross Abstract: Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases — and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses […]
TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection
arXiv:2605.19738v1 Announce Type: cross Abstract: Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders […]
Transformers Linearly Represent Highly Structured World Models
arXiv:2605.18847v1 Announce Type: cross Abstract: Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We […]
How Far Are We From True Auto-Research?
arXiv:2605.19156v1 Announce Type: new Abstract: Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi […]
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
arXiv:2605.18852v1 Announce Type: cross Abstract: Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this […]
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
arXiv:2605.19846v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, […]
Towards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables
arXiv:2605.18862v1 Announce Type: cross Abstract: Cardiovascular disease remains the leading cause of death worldwide, and early detection of arrhythmias through continuous ECG monitoring on wearable devices can prevent life-threatening events. Federated Learning (FL) enables privacy-preserving collaborative training by keeping raw ECG data on device, yet standard FL incurs prohibitive communication overhead and standard deep learning […]
A putative model of the gut-muscle axis in aged livestock
arXiv:2605.19171v1 Announce Type: new Abstract: The gut-muscle axis has been proposed to link gut microbiota with skeletal muscle physiology, yet its universality across livestock species remains unclear. Using aged laying hens, a livestock model with a relatively short digestive tract, we examined the gut microbiota, faecal metabolome, and breast-muscle metabolome by integrative multi-omics analyses in […]
EVA-0: Test-Time Model Evolution with Only Two Forward Passes per Sample
arXiv:2605.18867v1 Announce Type: cross Abstract: Test-time model evolution offers a promising way for deployed models to improve from unlabeled test-time experience, yet most existing methods depend on backpropagation (BP), which incurs substantial memory overhead and makes them difficult to deploy on edge devices, quantized models, specialized accelerators, or black-box models. In this work, we study […]
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL
arXiv:2605.19988v1 Announce Type: cross Abstract: Documentation has long guided computer system tuning by distilling expert knowledge into per-parameter recommendations. Yet such guides capture only what experts conclude, discarding how they reason. This fundamental gap manifests in three concrete deficiencies: documentation grows stale as software evolves, fails under heterogeneous workloads, and ignores inter-parameter dependencies. We propose […]
EUPHORIA: Efficient Universal Planning via Hybrid Optimization for Robust Industrial Robotic Assembly
arXiv:2605.18872v1 Announce Type: cross Abstract: Robotic assembly in architectural construction faces a persistent bottleneck: existing planners are either highly specialized, requiring prohibitive retraining for every new geometric design, or operationally inefficient, treating structural sequencing and kinematic motion as disjoint processes. We present EUPHORIA, a unified framework that achieves universal few-shot adaptability and dynamic efficiency through […]
Discoverable Agent Knowledge — A Formal Framework for Agentic KG Affordances (Extended Version)
arXiv:2605.19186v1 Announce Type: new Abstract: Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, and invoke web services coherently. The response was OWL-S and WSMO: formally grounded capability descriptions specifying what a service could do, what the agent must already know for invocation to be […]