• Home
  • Uncategorized
  • Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

arXiv:2602.23696v3 Announce Type: replace-cross
Abstract: We analyze cumulative parameter trajectories of transformer training under AdamW and identify a dominant low-dimensional drift direction (“backbone”) that captures 60–80% of long-horizon displacement from initialization. This direction is highly stable over rolling training windows yet reorients gradually across phases, particularly following objective reweighting. Per-batch gradients exhibit near-noise-floor alignment with the backbone, whereas optimizer-integrated updates align strongly with it, indicating that the structure emerges from accumulated optimizer dynamics rather than instantaneous gradient geometry.
Replacing AdamW with SGD-family optimizers eliminates this structure, and reducing $beta_2$ smoothly degrades backbone dominance and reheating recoverability. Reheating experiments show that transverse probe modes can be transiently re-excited without substantially altering accumulated backbone drift.
These results provide a trajectory-level characterization of optimizer-induced geometric structure in transformer training and shift attention from instantaneous gradient properties to cumulative update dynamics.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844