arXiv:2604.04384v1 Announce Type: cross
Abstract: Across every attention head in five transformer language models (124M–7B parameters, four architecture families), the logit energy field $tildeE$ reaches 90% of its variance in 2–11 singular components. The emphlearned interaction matrix $W_Q^mathrmT W_K$ needs 38–75 components for the same threshold out of $d_h in 64, 128$. The spectral gap is $5$–$25times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.
What’s in a name? Moderna’s “vaccine” vs. “therapy” dilemma
Is it the Department of Defense or the Department of War? The Gulf of Mexico or the Gulf of America? A vaccine—or an “individualized neoantigen


