arXiv:2605.19407v1 Announce Type: cross Abstract: We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data […]
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
arXiv:2605.18740v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting […]
Features have life history. And we should care
arXiv:2605.18789v1 Announce Type: new Abstract: Features in language models have life history: they emerge, persist, and die during training, yet the importance of that history remains largely unexplored. We find evidence of a persistent representational backbone, which we identify in Pythia-160M and -410M as the carrier scaffold: $sim50$ sparse features with stable life histories, around […]
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
arXiv:2605.19604v1 Announce Type: new Abstract: Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual […]
EmbGen: Teaching with Reassembled Corpora
arXiv:2605.19394v1 Announce Type: cross Abstract: Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not […]
Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
arXiv:2605.19663v1 Announce Type: new Abstract: Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended […]
Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization
arXiv:2605.17839v2 Announce Type: replace-cross Abstract: Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of […]
GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction
arXiv:2605.19765v1 Announce Type: new Abstract: Existing affective-computing, social-signal-processing, and meeting corpora capture important parts of human interaction, but they rarely support analysis of affect in co-located groups as a coupled individual, interpersonal, and group-level process. The required signals (per-participant physiology, eye movement, audio, self-report, task outcomes, and personality) are usually fragmented across separate dataset traditions. […]
The Evaluation Game: Beyond Static LLM Benchmarking
arXiv:2605.19377v1 Announce Type: cross Abstract: As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the […]
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning
arXiv:2605.19824v1 Announce Type: new Abstract: Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and […]
1GC-7RC: One Graphic Card — Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
arXiv:2605.17046v2 Announce Type: replace-cross Abstract: Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a […]
When Skills Don’t Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity
arXiv:2605.20023v1 Announce Type: new Abstract: Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are […]