• Home
  • Uncategorized
  • Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts

arXiv:2507.15465v3 Announce Type: replace-cross
Abstract: Computational workloads composing traditional transformer models are starkly bifurcated. Multi-Head Attention (MHA) and Grouped-Query Attention are memory-bound due to low arithmetic intensity, while FeedForward Networks are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the attention bottleneck.
This paper argues that recent architectural advances in transformer models — Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) — introduce new dominant bottlenecks, shifting the challenge away from memory-intensive attention. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude higher than that of MHA, moving it toward a compute-bound regime well-matched to modern accelerators such as GPUs. Second, distributing MoE experts across a pool of accelerators allows batching to tune their arithmetic intensity to that of dense layers, producing a more balanced computational profile. Consequently, the focus of hardware and system optimization should shift from attention acceleration to high-bandwidth interconnects and balancing expert workloads across accelerators.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844