arXiv:2505.18244v3 Announce Type: replace-cross
Abstract: Why do language models from different architecture families respond so differently to the same perturbation? We argue that the answer is not scale, but emphhow architecture shapes information compression. Analyzing eight Transformer models (7B–70B parameters) from the Llama and Qwen families, we show that every model spontaneously develops discrete functional boundaries dividing its layers into Local, Intermediate, and Global processing segments — yet boundary locations and per-segment brittleness are determined overwhelmingly by architecture family rather than model size or training configuration. We formalize this regularity as the textbfMulti-Scale Probabilistic Generation Theory (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives a tiered set of falsifiable predictions. Three predictions are strongly confirmed: all eight models exhibit two prominent phase-transition boundaries (P1.1); Llama boundary positions are stable across a $10times$ parameter range ($mathrmCV=0.067$–$0.095$) while Qwen positions vary widely ($mathrmCV=0.465$–$0.726$), precisely matching our strong- and weak-dominance conditions; and cross-architecture local-segment brittleness spans textbfthree orders of magnitude ($493times$ ratio) — a gap that architecture family alone predicts and that dwarfs any within-family or scale-driven variation.
Digital health tools and point solutions—pitfalls in population health program measurement
Digital health tools are generally poorly regulated and often lack strong research evidence, posing challenges for purchasers of point solutions such as employer groups and