• Home
  • Uncategorized
  • Emergent Hierarchical Structure in Large Language Models: An Information-Theoretic Framework for Multi-Scale Representation

arXiv:2505.18244v3 Announce Type: replace-cross
Abstract: Why do language models from different architecture families respond so differently to the same perturbation? We argue that the answer is not scale, but emphhow architecture shapes information compression. Analyzing eight Transformer models (7B–70B parameters) from the Llama and Qwen families, we show that every model spontaneously develops discrete functional boundaries dividing its layers into Local, Intermediate, and Global processing segments — yet boundary locations and per-segment brittleness are determined overwhelmingly by architecture family rather than model size or training configuration. We formalize this regularity as the textbfMulti-Scale Probabilistic Generation Theory (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives a tiered set of falsifiable predictions. Three predictions are strongly confirmed: all eight models exhibit two prominent phase-transition boundaries (P1.1); Llama boundary positions are stable across a $10times$ parameter range ($mathrmCV=0.067$–$0.095$) while Qwen positions vary widely ($mathrmCV=0.465$–$0.726$), precisely matching our strong- and weak-dominance conditions; and cross-architecture local-segment brittleness spans textbfthree orders of magnitude ($493times$ ratio) — a gap that architecture family alone predicts and that dwarfs any within-family or scale-driven variation.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844