arXiv:2603.13428v1 Announce Type: cross Abstract: With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this […]
Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains
arXiv:2603.14400v1 Announce Type: cross Abstract: The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model […]
Post Training Quantization for Efficient Dataset Condensation
arXiv:2603.13346v1 Announce Type: cross Abstract: Dataset Condensation (DC) distills knowledge from large datasets into smaller ones, accelerating training and reducing storage requirements. However, despite notable progress, prior methods have largely overlooked the potential of quantization for further reducing storage costs. In this paper, we take the first step to explore post-training quantization in dataset condensation, […]
Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation
arXiv:2603.13891v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used for automated text annotation in tasks ranging from academic research to content moderation and hiring. Across 19 LLMs and two experiments totaling more than 4 million annotation judgments, we show that subtle identity cues embedded in text systematically bias annotation outcomes in ways […]
Artificial intelligence-enabled single-lead ECG for non-invasive hyperkalemia detection: development, multicenter validation, and proof-of-concept deployment
arXiv:2603.14177v1 Announce Type: cross Abstract: Hyperkalemia is a life-threatening electrolyte disorder that is common in patients with chronic kidney disease and heart failure, yet frequent monitoring remains difficult outside hospital settings. We developed and validated Pocket-K, a single-lead AI-ECG system initialized from the ECGFounder foundation model for non-invasive hyperkalemia screening and handheld deployment. In this […]
Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation
arXiv:2603.12983v2 Announce Type: replace-cross Abstract: Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel […]
Anterior’s Approach to Fairness Evaluation of Automated Prior Authorization System
arXiv:2603.14631v1 Announce Type: cross Abstract: Increasing staffing constraints and turnaround-time pressures in Prior authorization (PA) have led to increasing automation of decision systems to support PA review. Evaluating fairness in such systems poses unique challenges because legitimate clinical guidelines and medical necessity criteria often differ across demographic groups, making parity in approval rates an inappropriate […]
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
arXiv:2603.13228v2 Announce Type: replace-cross Abstract: Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable […]
A Methodology for Thermal Limit Bias Predictability Through Artificial Intelligence
arXiv:2603.14648v1 Announce Type: cross Abstract: Nuclear power plant operators face significant challenges due to unpredictable deviations between offline and online thermal limits, a phenomenon known as thermal limit bias, which leads to conservative design margins, increased fuel costs, and operational inefficiencies. This work presents a deep learning based methodology to predict and correct this bias […]
Human Attribution of Causality to AI Across Agency, Misuse, and Misalignment
arXiv:2603.13236v1 Announce Type: new Abstract: AI-related incidents are becoming increasingly frequent and severe, ranging from safety failures to misuse by malicious actors. In such complex situations, identifying which elements caused an adverse outcome, the problem of cause selection, is a critical first step for establishing liability. This paper investigates folk perceptions of causal responsibility in […]
Revisiting Model Stitching In the Foundation Model Era
arXiv:2603.12433v2 Announce Type: replace-cross Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit […]
Compute Allocation for Reasoning-Intensive Retrieval Agents
arXiv:2603.14635v1 Announce Type: cross Abstract: As agents operate over long horizons, their memory stores grow continuously, making retrieval critical to accessing relevant information. Many agent queries require reasoning-intensive retrieval, where the connection between query and relevant documents is implicit and requires inference to bridge. LLM-augmented pipelines address this through query expansion and candidate re-ranking, but […]