Disclosure in the era of generative artificial intelligence

Generative artificial intelligence (AI) has rapidly become embedded in academic writing, assisting with tasks ranging from language editing to drafting text and producing evidence. Despite

The Influencing Factors of Medical Postgraduates’ Usage Intention Toward Artificial Intelligence–Generated Content Tools in Academic Research: Qualitative Analysis

Background: The integration of artificial intelligence–generated content (AIGC) tools into academic research offers transformative potential for enhancing productivity and innovation. However, within the highly regulated

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

arXiv:2604.27495v1 Announce Type: cross Abstract: Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often

Why Self-Supervised Encoders Want to Be Normal

arXiv:2604.27743v1 Announce Type: cross Abstract: We develop a geometric and information-theoretic framework for encoder-decoder learning built on the Information Bottleneck (IB) principle. Recasting IB as

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

arXiv:2508.21787v2 Announce Type: replace-cross Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K–V Asymmetry

April 28, 2026

arXiv:2604.22778v1 Announce Type: cross
Abstract: We present the first systematic study of weight matrix singular value spectra emphduring transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M–285M parameters). We discover three phenomena: textbf(1)~Transient Compression Waves: stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then emphreverses — late layers eventually over-compress past early layers. textbf(2)~Persistent Spectral Gradients: the power-law exponent~$alpha$ develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. textbf(3)~Q/K–V Functional Asymmetry: value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that emphrank and spectral shape encode fundamentally different information about training. We formalize this as a two-timescale dynamical model and derive scaling laws ($Deltaalpha propto L^0.26$, $R^2=0.99$). We validate on nine models across three families (custom, GPT-2, Pythia; 30M–1B parameters; 8–36 layers), demonstrate that $alpha$ predicts layer importance ($rho=0.69$–$0.84$, $p<0.02$), and show that spectral-guided pruning outperforms Last-N heuristics by $1.1times$–$3.6times$ across seven models in two families (GPT-2 124M–774M, Pythia 160M–1B), with worst-vs-best gaps up to $23.7times$ confirming the causal role of spectral structure.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844