arXiv:2510.26099v1 Announce Type: cross Abstract: The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce […]
Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking
arXiv:2409.06263v2 Announce Type: replace-cross Abstract: Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities […]
WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
arXiv:2510.26125v1 Announce Type: cross Abstract: Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall […]
Estimating cognitive biases with attention-aware inverse planning
arXiv:2510.25951v1 Announce Type: new Abstract: People’s goal-directed behaviors are influenced by their cognitive biases, and autonomous systems that interact with people should be aware of this. For example, people’s attention to objects in their environment will be biased in a way that systematically affects how they perform everyday tasks such as driving to work. Here, […]
Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series
arXiv:2510.26159v1 Announce Type: cross Abstract: In this study, we investigate the effectiveness of advanced feature engineering and hybrid model architectures for anomaly detection in a multivariate industrial time series, focusing on a steam turbine system. We evaluate the impact of change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies on detection performance. Despite […]
Empowering Agentic Video Analytics Systems with Video Language Models
arXiv:2505.00254v4 Announce Type: replace-cross Abstract: AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, […]
ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts
arXiv:2510.26186v1 Announce Type: cross Abstract: Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained […]
From Queries to Insights: Agentic LLM Pipelines for Spatio-Temporal Text-to-SQL
arXiv:2510.25997v1 Announce Type: new Abstract: Natural-language-to-SQL (NL-to-SQL) systems hold promise for democratizing access to structured data, allowing users to query databases without learning SQL. Yet existing systems struggle with realistic spatio-temporal queries, where success requires aligning vague user phrasing with schema-specific categories, handling temporal reasoning, and choosing appropriate outputs. We present an agentic pipeline that […]
Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
arXiv:2510.26205v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. […]
GenIR: Generative Visual Feedback for Mental Image Retrieval
arXiv:2506.06220v2 Announce Type: replace-cross Abstract: Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from […]