arXiv:2602.17598v2 Announce Type: replace-cross Abstract: Speech LLMs are widely understood to be better than ASR$rightarrow$LLM cascades since they have access to the audio directly, and not just the transcript. In this paper, we present an evaluation methodology and a mechanistic interpretation of the observed behavior of speech LLMs. First, we introduce matched-backbone testing which separates […]
TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks
arXiv:2603.05764v1 Announce Type: cross Abstract: Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time […]
The World Won’t Stay Still: Programmable Evolution for Agent Benchmarks
arXiv:2603.05910v1 Announce Type: new Abstract: LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents’ robustness to environmental changes. In this paper, we study a […]
PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models
arXiv:2603.05776v1 Announce Type: cross Abstract: Motivation: Patient-generated text contains critical information about patients’ lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable […]
Measuring AI R&D Automation
arXiv:2603.03992v3 Announce Type: replace-cross Abstract: The automation of AI R&D (AIRDA) could have significant implications, but its extent and ultimate effects remain uncertain. We need empirical data to resolve these uncertainties, but existing data (primarily capability benchmarks) may not reflect real-world automation or capture its broader consequences, such as whether AIRDA accelerates capabilities more than […]
Human-Data Interaction, Exploration, and Visualization in the AI Era: Challenges and Opportunities
arXiv:2603.05542v1 Announce Type: cross Abstract: The rapid advancement of AI is transforming human-centered systems, with profound implications for human-AI interaction, human-data interaction, and visual analytics. In the AI era, data analysis increasingly involves large-scale, heterogeneous, and multimodal data that is predominantly unstructured, as well as foundation models such as LLMs and VLMs, which introduce additional […]
DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
arXiv:2603.05912v1 Announce Type: new Abstract: Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that […]
Towards Efficient and Stable Ocean State Forecasting: A Continuous-Time Koopman Approach
arXiv:2603.05560v1 Announce Type: cross Abstract: We investigate the Continuous-Time Koopman Autoencoder (CT-KAE) as a lightweight surrogate model for long-horizon ocean state forecasting in a two-layer quasi-geostrophic (QG) system. By projecting nonlinear dynamics into a latent space governed by a linear ordinary differential equation, the model enforces structured and interpretable temporal evolution while enabling temporally resolution-invariant […]
LiveSense: A Real-Time Wi-Fi Sensing Platform for Range-Doppler on COTS Laptop
arXiv:2603.06545v1 Announce Type: cross Abstract: We present LiveSense – a cross-platform that transforms a commercial off-the-shelf (COTS) Wi-Fi Network Interface Card (NIC) on a laptop into a centimeter-level Range-Doppler sensor while preserving simultaneous communication capability. The laptops are equipped with COTS Intel AX211 (Wi-Fi 6E) or Intel BE201 (Wi-Fi 7) NICs. LiveSense can (i) Extract […]
When AI Levels the Playing Field: Skill Homogenization, Asset Concentration, and Two Regimes of Inequality
arXiv:2603.05565v1 Announce Type: cross Abstract: Generative AI compresses within-task skill differences while shifting economic value toward concentrated complementary assets, creating an apparent paradox: the technology that equalizes individual performance may widen aggregate inequality. We formalize this tension in a task-based model with endogenous education, employer screening, and heterogeneous firms. The model yields two regimes whose […]
An Interactive Multi-Agent System for Evaluation of New Product Concepts
arXiv:2603.05980v1 Announce Type: new Abstract: Product concept evaluation is a critical stage that determines strategic resource allocation and project success in enterprises. However, traditional expert-led approaches face limitations such as subjective bias and high time and cost requirements. To support this process, this study proposes an automated approach utilizing a large language model (LLM)-based multi-agent […]
PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions
arXiv:2603.05574v1 Announce Type: cross Abstract: This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation. This approach bridges Imitation Learning (IL) and Reinforcement Learning (RL) frameworks into a seamless pipeline, such that an imitation policy on a broad generic task, generated from a set of user-guided demonstrations, can be refined through […]