Why Self-Supervised Encoders Want to Be Normal

arXiv:2604.27743v1 Announce Type: cross Abstract: We develop a geometric and information-theoretic framework for encoder-decoder learning built on the Information Bottleneck (IB) principle. Recasting IB as

  • Home
  • Uncategorized
  • The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K–V Asymmetry

arXiv:2604.22778v1 Announce Type: cross
Abstract: We present the first systematic study of weight matrix singular value spectra emphduring transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M–285M parameters). We discover three phenomena: textbf(1)~Transient Compression Waves: stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then emphreverses — late layers eventually over-compress past early layers. textbf(2)~Persistent Spectral Gradients: the power-law exponent~$alpha$ develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. textbf(3)~Q/K–V Functional Asymmetry: value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that emphrank and spectral shape encode fundamentally different information about training. We formalize this as a two-timescale dynamical model and derive scaling laws ($Deltaalpha propto L^0.26$, $R^2=0.99$). We validate on nine models across three families (custom, GPT-2, Pythia; 30M–1B parameters; 8–36 layers), demonstrate that $alpha$ predicts layer importance ($rho=0.69$–$0.84$, $p<0.02$), and show that spectral-guided pruning outperforms Last-N heuristics by $1.1times$–$3.6times$ across seven models in two families (GPT-2 124M–774M, Pythia 160M–1B), with worst-vs-best gaps up to $23.7times$ confirming the causal role of spectral structure.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844