arXiv:2603.23510v1 Announce Type: cross Abstract: As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT […]
DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models
arXiv:2603.23514v1 Announce Type: cross Abstract: Large Language Models appear competent when answering general questions but often fail when pushed into domain-specific details. No existing methodology provides an out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains. We present DepthCharge, a domain-agnostic framework that measures knowledge depth […]
Completeness of Unbounded Best-First Minimax and Descent Minimax
arXiv:2603.24572v1 Announce Type: new Abstract: In this article, we focus on search algorithms for two-player perfect information games, whose objective is to determine the best possible strategy, and ideally a winning strategy. Unfortunately, some search algorithms for games in the literature are not able to always determine a winning strategy, even with an infinite search […]
Evidence for Limited Metacognition in LLMs
arXiv:2509.21545v2 Announce Type: cross Abstract: The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on […]
A Metric for Three-Dimensional Color Discrimination Derived from V1 Population Fisher Information
arXiv:2603.24356v1 Announce Type: new Abstract: We derive a Riemannian metric on three-dimensional color space from the Fisher information of neural population codes in the visual pathway. Photoreceptor adaptation, retinal opponent channels, and cortical population encoding each map onto a geometric construction, producing a metric tensor whose components correspond to measurable neural quantities. The resulting 17-parameter […]
Tiny Inference-Time Scaling with Latent Verifiers
arXiv:2603.22492v2 Announce Type: replace-cross Abstract: Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines […]
PLDR-LLMs Reason At Self-Organized Criticality
arXiv:2603.23539v1 Announce Type: new Abstract: We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs […]
Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs
arXiv:2603.23532v1 Announce Type: cross Abstract: This paper investigates whether structured representations can preserve the meaning of scientific sentences. To test this, a lightweight LLM is fine-tuned using a novel structural loss function to generate hierarchical JSON structures from sentences collected from scientific articles. These JSONs are then used by a generative model to reconstruct the […]
Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
arXiv:2603.24481v1 Announce Type: new Abstract: Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical […]
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
arXiv:2410.02064v3 Announce Type: cross Abstract: It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. […]
Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes
arXiv:2603.23507v1 Announce Type: cross Abstract: While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion […]
S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering
arXiv:2603.23512v1 Announce Type: cross Abstract: We present S-Path-RAG, a semantic-aware shortest-path Retrieval-Augmented Generation framework designed to improve multi-hop question answering over large knowledge graphs. S-Path-RAG departs from one-shot, text-heavy retrieval by enumerating bounded-length, semantically weighted candidate paths using a hybrid weighted $k$-shortest, beam, and constrained random-walk strategy, learning a differentiable path scorer together with a […]