arXiv:2604.26024v1 Announce Type: cross Abstract: Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on average to fail on specific subpopulations. Prior work has shown that common evaluation measures for imbalanced classification are biased toward larger minority subconcepts and that utility-based reweighting using true subconcept labels […]
Evaluating the Alignment Between GeoAI Explanations and Domain Knowledge in Satellite-Based Flood Mapping
arXiv:2604.26051v1 Announce Type: cross Abstract: The increasing number of satellites has improved the temporal resolution of Earth observation, making satellite-based flood mapping a promising approach for operational flood monitoring. Deep learning-based approaches for flood mapping using satellite imagery, an important application within Geospatial Artificial Intelligence (GeoAI), have shown improved predictive performance by learning complex spatial […]
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
arXiv:2604.16552v2 Announce Type: replace-cross Abstract: Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM’s help. Moreover, the generated scene is often inconsistent […]
Assessing the Utility of Volumetric Motion Fields for Radar-based Precipitation Nowcasting with Physics-informed Deep Learning
arXiv:2603.13589v2 Announce Type: replace-cross Abstract: Estimating motion from spatiotemporal geoscientific data is a fundamental component of many environmental modeling and forecasting tasks. In this work, we propose a physics-informed deep learning framework for estimating altitude-wise motion fields directly from volumetric radar reflectivity data. The model utilizes a fully differentiable semi-Lagrangian extrapolation operator to process three-dimensional […]
q3-MuPa: Quick, Quiet, Quantitative Multi-Parametric MRI using Physics-Informed Diffusion Models
arXiv:2512.23726v2 Announce Type: replace-cross Abstract: The 3D fast silent multi-parametric mapping sequence with zero echo time (MuPa-ZTE) is a novel quantitative MRI (qMRI) acquisition that enables nearly silent scanning by using a 3D phyllotaxis sampling scheme. MuPa-ZTE improves patient comfort and motion robustness, and generates quantitative maps of T1, T2, and proton density using the […]
Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
arXiv:2508.12672v4 Announce Type: replace-cross Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses […]
M2R2: MultiModal Robotic Representation for Temporal Action Segmentation
arXiv:2504.18662v3 Announce Type: replace-cross Abstract: Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such […]
A Framework for Longitudinal Health AI Agents
arXiv:2604.12019v3 Announce Type: replace Abstract: Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, both within and beyond clinical settings, […]
ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents
arXiv:2412.13682v5 Announce Type: replace Abstract: Travel planning stands out among real-world applications of emphLanguage Agents because it couples significant practical demand with a rigorous constraint-satisfaction challenge. However, existing benchmarks primarily operate on a slot-filling paradigm, restricting agents to synthetic queries with pre-defined constraint menus, which fails to capture the open-ended nature of natural language interaction, […]
Student Guides Teacher: Weak-to-Strong Inference via Spectral Orthogonal Exploration
arXiv:2601.06160v2 Announce Type: replace Abstract: Large Language Models (LLMs) often suffer from ”Reasoning Collapse” on challenging mathematical reasoning tasks, where stochastic sampling produces lexical variations of the same erroneous logic rather than genuine semantic exploration. We observe that failed reasoning traces are often associated with a low-rank bias manifold in the model’s hidden-state geometry, which […]
Identifying the Achilles’ Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models
arXiv:2401.00761v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs’ veracity […]
MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text
arXiv:2506.02494v2 Announce Type: replace-cross Abstract: Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing researches often simply collect large-scale evaluation data for training, while overlooking the quality of […]