Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

arXiv:2605.27155v1 Announce Type: cross Abstract: Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

arXiv:2605.27016v1 Announce Type: cross Abstract: Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment.

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

arXiv:2601.21972v5 Announce Type: replace Abstract: Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined

A Physics-Informed Hierarchical Neural Network for Microwave Scattering Analysis of 3D PEC Targets

arXiv:2508.03774v5 Announce Type: replace-cross Abstract: Accurate modeling of scattering from three-dimensional (3D) perfectly electrically conducting (PEC) targets at microwave frequencies constitutes a fundamental objective in

Experiments in Agentic AI for Science

arXiv:2605.26305v1 Announce Type: new Abstract: This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local