Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

arXiv:2606.10107v1 Announce Type: cross Abstract: Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or […]

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

arXiv:2606.10159v1 Announce Type: cross Abstract: AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: […]

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv:2606.10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in emphsolving high-school mathematics, their ability to emphevaluate the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce textbfRealMath-Eval, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals […]

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

arXiv:2604.14397v2 Announce Type: replace-cross Abstract: We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns […]

YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

arXiv:2606.10244v1 Announce Type: cross Abstract: We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, […]

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

arXiv:2606.10279v1 Announce Type: new Abstract: Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer’s disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment […]

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

arXiv:2606.10315v1 Announce Type: cross Abstract: LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript […]

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

arXiv:2606.08674v2 Announce Type: replace-cross Abstract: Existing video generation frameworks treat sequence duration as an externally prescribed parameter — fixed frame counts or text prompts — producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals […]

A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

arXiv:2606.10366v1 Announce Type: cross Abstract: Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. […]

Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

arXiv:2606.10286v1 Announce Type: new Abstract: Open-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed-Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational complexity and inability to adapt in real time limit its practical deployment in dynamic industrial environments. This work introduces a simulator-driven […]

NOVA: Symbolic Regression Discovery of Interpretable Car-Following and Lane-Change Models with Driver Heterogeneity

arXiv:2606.10583v1 Announce Type: cross Abstract: We present NOVA, an autonomous symbolic regression framework that identifies interpretable car-following and lane-change structures from raw trajectory data with minimal behavioral priors. Applied to 4,765,788 active driving observations from the NGSIM I-80 and US-101 datasets, NOVA’s deterministic Rust-powered search engine evaluates over 10,000 candidate algebraic structures and identifies a […]

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

arXiv:2606.10298v1 Announce Type: new Abstract: When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a emphcontext-aware paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844