arXiv:2604.04384v1 Announce Type: cross
Abstract: Across every attention head in five transformer language models (124M–7B parameters, four architecture families), the logit energy field $tildeE$ reaches 90% of its variance in 2–11 singular components. The emphlearned interaction matrix $W_Q^mathrmT W_K$ needs 38–75 components for the same threshold out of $d_h in 64, 128$. The spectral gap is $5$–$25times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844