arXiv:2604.01993v1 Announce Type: cross Abstract: Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: […]
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
arXiv:2604.02091v1 Announce Type: cross Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often […]
Universal Hypernetworks for Arbitrary Models
arXiv:2604.02215v1 Announce Type: cross Abstract: Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the emphUniversal Hypernetwork (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator […]
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
arXiv:2604.02322v1 Announce Type: cross Abstract: Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training […]
Cardiac-Phase-Dependent Spin Coherence as a Probe of Boundary Covariance Geometry in Neural Tissue
arXiv:2505.22680v2 Announce Type: replace Abstract: A recently proposed geometric framework predicts that the transition from distributed belief to committed action involves a metric regime change, culminating in a boundary regime where cross-mode structure becomes algebraically necessary for continued state-space compression. This paper examines whether reported magnetic resonance measurements of proton spins in neural tissue provide […]
Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
arXiv:2604.00528v2 Announce Type: replace-cross Abstract: 3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our […]
CogBias: Measuring and Mitigating Cognitive Bias in Large Language Models
arXiv:2604.01366v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making contexts. While prior work has shown that LLMs exhibit cognitive biases behaviorally, whether these biases correspond to identifiable internal representations and can be mitigated through targeted intervention remains an open question. We define LLM cognitive bias as systematic, reproducible deviations […]
Predicting LLM Output Length via Entropy-Guided Representations
arXiv:2602.11812v2 Announce Type: replace Abstract: The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic “one-to-many” sampling scenarios. We […]
RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
arXiv:2604.01375v1 Announce Type: new Abstract: Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. […]
LLM-as-a-Judge for Time Series Explanations
arXiv:2604.02118v1 Announce Type: new Abstract: Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate […]
Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
arXiv:2604.01712v1 Announce Type: cross Abstract: The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the […]
The Geometric Anatomy of Capability Acquisition in Transformers
arXiv:2602.15997v4 Announce Type: replace-cross Abstract: Neural networks gain capabilities during training, but the internal changes that precede capability acquisition are not well understood. In particular, the relationship between geometric change and behavioral change, and the effect of task difficulty and model scale on that relationship, is unclear. We track geometric measures and linear probes across […]