Digital health tools and point solutions—pitfalls in population health program measurement

Digital health tools are generally poorly regulated and often lack strong research evidence, posing challenges for purchasers of point solutions such as employer groups and

Crisis support teams’ technological openness and learning attitudes toward the AI based virtual patient system crisis support VR

BackgroundAgainst the backdrop of escalating global humanitarian crises, innovative didactic simulations are becoming increasingly important. A promising alternative to traditional classroom-based didactics for learning psychological

Ensemble based in transfer learning for cytological classification in pleural fluid

Pleural effusion cytology is critical for diagnosing benign and malignant conditions, yet manual interpretation remains time-consuming and prone to subjectivity. The increasing burden of malignant

From Engel’s Bio-Psycho-Social model to the personalized health determinants model: a comprehensive framework and illustrative operationalization for precision health

Engel’s Bio-Psycho-Social (BPS) model (1977) reframed healthcare by integrating biological, psychological, and social perspectives. Despite its influence, the model has been criticized for insufficient specificity

Advancing women’s health through equity in quantitative sciences: promoting sex- and gender-based modeling in clinical trials and real-world studies

Post Content

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

May 5, 2026

arXiv:2603.13331v2 Announce Type: replace
Abstract: Grokking — the sudden generalisation that appears long after a model has perfectly memorised its training data — has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_mathrmgrok – T_mathrmmem = Theta(gamma_mathrmeff^-1 log(|theta_mathrmmem|^2 / |theta_mathrmpost|^2))$, where $gamma_mathrmeff$ is the optimiser’s effective contraction rate ($gamma_mathrmeff = etalambda$ for SGD, $gamma_mathrmeff ge etalambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844