Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

arXiv:2605.23595v3 Announce Type: replace-cross Abstract: The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to […]

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

arXiv:2606.06915v2 Announce Type: replace-cross Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens […]

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

arXiv:2606.09012v1 Announce Type: cross Abstract: Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the […]

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

arXiv:2606.09236v1 Announce Type: cross Abstract: Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a […]

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

arXiv:2606.09607v1 Announce Type: cross Abstract: Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads — but validating by causal ablation rather than […]

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

arXiv:2503.14229v4 Announce Type: replace Abstract: Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 […]

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

arXiv:2602.21172v3 Announce Type: replace Abstract: Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with NORD (No Reasoning for Driving). Compared to existing VLAs, NORD achieves […]

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

arXiv:2605.25985v2 Announce Type: replace Abstract: Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $textEFO_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $mathcalE^k$, where $mathcalE$ denotes the entity set of a […]

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

arXiv:2502.16584v2 Announce Type: replace-cross Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning […]

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

arXiv:2511.11041v2 Announce Type: replace-cross Abstract: We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $tilde e + mu$, where the mean $mu$ is near-identical across all sentences. We study two training-free corrections — subtracting $mu$ directly (R1), or projecting each embedding off the mean direction (R2) — […]

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

arXiv:2601.22859v3 Announce Type: replace-cross Abstract: The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation […]

Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

arXiv:2603.25184v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844