arXiv:2602.13595v2 Announce Type: replace
Abstract: Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile ($E propto mathrmbits$). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a ‘quantization trap’ where reducing precision from 16-bit to 8/4-bit paradoxically increases net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. We formalize a Critical Model Scale $N^*$ that predicts when the trap dissolves or deepens as a function of model size, batch size, and hardware configuration, validated across a 120$times$ range (0.6B–72B) on six GPU architectures. Our findings suggest that the industry’s “smaller-is-better” heuristic is mathematically counterproductive for complex reasoning tasks.
A blueprint for using AI to strengthen democracy
Every few centuries, changes in how information moves reshape how societies govern themselves. The printing press spread vernacular literacy, helping give rise to the Reformation

