arXiv:2603.13331v2 Announce Type: replace
Abstract: Grokking — the sudden generalisation that appears long after a model has perfectly memorised its training data — has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_mathrmgrok – T_mathrmmem = Theta(gamma_mathrmeff^-1 log(|theta_mathrmmem|^2 / |theta_mathrmpost|^2))$, where $gamma_mathrmeff$ is the optimiser’s effective contraction rate ($gamma_mathrmeff = etalambda$ for SGD, $gamma_mathrmeff ge etalambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.
Digital health tools and point solutions—pitfalls in population health program measurement
Digital health tools are generally poorly regulated and often lack strong research evidence, posing challenges for purchasers of point solutions such as employer groups and