Inside Interoception: The hidden sense of how you feel inside

MIT Technology Review Explains: Let our writers untangle the complex, messy world of science and technology to help you understand what’s coming next. You can read more

Why “reprogramming” is the buzziest approach to reversing aging right now

Earlier this week, Life Biosciences, a biotech company focused on reversing age-related diseases, announced that it had dosed its first volunteer. A person with glaucoma

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

arXiv:2606.11836v2 Announce Type: replace-cross Abstract: This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More

Phase model analysis of the effect of M-current on neural synchrony in hippocampal networks

arXiv:2606.12684v1 Announce Type: new Abstract: Neural assemblies, transiently coordinated groups of neurons, observed in the hippocampus are thought to underlie the formation of episodic memories.

Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

arXiv:2606.13222v1 Announce Type: cross Abstract: Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

June 10, 2026

arXiv:2606.10254v1 Announce Type: new
Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in emphsolving high-school mathematics, their ability to emphevaluate the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce textbfRealMath-Eval, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark “Evaluation Gap”: judges are considerably more accurate and consistent on synthetic text (MSE $sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a “structural collapse” into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844