arXiv:2603.29450v1 Announce Type: cross Abstract: While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either […]
Grokking From Abstraction to Intelligence
arXiv:2603.29262v1 Announce Type: new Abstract: Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives […]
Baby Scale: Investigating Models Trained on Individual Children’s Language Input
arXiv:2603.29522v1 Announce Type: cross Abstract: Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this “data gap” requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from […]
PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
arXiv:2603.29318v1 Announce Type: new Abstract: Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately […]
Reducing Complexity for Quantum Approaches in Train Load Optimization
arXiv:2603.29543v1 Announce Type: cross Abstract: Efficiently planning container loads onto trains is a computationally challenging combinatorial optimization problem, central to logistics and supply chain management. A primary source of this complexity arises from the need to model and reduce rehandle operations-unproductive crane moves required to access blocked containers. Conventional mathematical formulations address this by introducing […]
Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models
arXiv:2603.29661v1 Announce Type: cross Abstract: Existing narrative extraction methods face a trade-off between coherence, interactivity, and multi-storyline support. Narrative Maps supports rich interaction and generates multiple storylines as a byproduct of its coverage constraints, though this comes at the cost of individual path coherence. Narrative Trails achieves high coherence through maximum capacity path optimization but […]
Nomad: Autonomous Exploration and Discovery
arXiv:2603.29353v1 Announce Type: new Abstract: We introduce Nomad, a system for autonomous data exploration and insight discovery. Given a corpus of documents, databases, or other data sources, users rarely know the full set of questions, hypotheses, or connections that could be explored. As a result, query-driven question answering and prompt-driven deep-research systems remain limited by […]
MacTok: Robust Continuous Tokenization for Image Generation
arXiv:2603.29634v1 Announce Type: cross Abstract: Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we […]
BenchScope: How Many Independent Signals Does Your Benchmark Provide?
arXiv:2603.29357v1 Announce Type: new Abstract: AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than […]
Generative AI in Action: Field Experimental Evidence from Alibaba’s Customer Service Operations
arXiv:2603.29888v1 Announce Type: cross Abstract: In collaboration with Alibaba, this study leverages a large-scale field experiment to assess the impact of a generative AI assistant on worker performance in e-commerce after-sales service. Human agents providing digital chat support were randomly assigned with access to a gen AI assistant that offered two core functions: diagnosis of […]
Retrospective Economic Evaluation of Group Testing in the COVID-19 Pandemic
arXiv:2603.28930v1 Announce Type: cross Abstract: Surveillance of diseases in a pandemic is an important part of public health policy. Diagnostic testing at the individual level is often infeasible due to resource constraints. To circumvent these constraints, group testing can be applied. The economic cost evaluation from the payer’s perspective typically focuses only on deterministic costs […]
Rigorous Explanations for Tree Ensembles
arXiv:2603.29361v1 Announce Type: new Abstract: Tree ensembles (TEs) find a multitude of practical applications. They represent one of the most general and accurate classes of machine learning methods. While they are typically quite concise in representation, their operation remains inscrutable to human decision makers. One solution to build trust in the operation of TEs is […]