Generative AI-assisted Participatory Modeling in Socio-Environmental Planning under Deep Uncertainty

arXiv:2603.17021v1 Announce Type: new Abstract: Socio-environmental planning under deep uncertainty requires researchers to identify and conceptualize problems before exploring policies and deploying plans. In practice

Sympatric speciation by symmetry-breaking: The three-clade case

arXiv:2603.17026v1 Announce Type: new Abstract: In this paper we expand the concept of biological speciation by symmetry breaking of Golubitsky and Stewart to the case

Intermitotic timing and motility patterns in the cell division of the diatom Seminavis robusta

arXiv:2603.16984v1 Announce Type: new Abstract: Many diatoms follow a size diminuation – size restoration cycle in their vegetative phase, leading to daughter cells that differ

Topology-Guided Biomechanical Profiling: A White-Box Framework for Opportunistic Screening of Spinal Instability on Routine CT

arXiv:2603.16963v1 Announce Type: new Abstract: Routine oncologic computed tomography (CT) presents an ideal opportunity for screening spinal instability, yet prophylactic stabilization windows are frequently missed

Non-perturbative Bacterial Identification Directly from Solid Agar Plates Using Raman

arXiv:2603.16957v1 Announce Type: new Abstract: Raman spectroscopy is a promising tool for microbial identification, yet its implementation in microbiology and clinical workflow is still restricted

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

March 19, 2026

arXiv:2603.04427v3 Announce Type: replace-cross
Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) — far fewer than value transfer needs.
We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch — unlike GQA and MLA, which must be designed into the architecture before pretraining. We factorize each key projection $W_K approx A_d times r B_r times d$ via truncated SVD (where $r = d_textselect$), set $W_K’ = A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^top$ into the query projection ($W_Q’ = W_Q B^top$) at zero cost — since queries are never cached. At 7B scale, training from scratch with $r = d_textmodel/4$ matches full-attention perplexity (9.2 vs 9.3 PPL after 20B tokens) while using 12% fewer parameters and training 8% faster. For existing models, SVD + QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at approximately 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to $16times$ combined key cache compression. For a 7B model serving 128K context, factored keys save 25 GB of KV cache per user, enabling approximately 60% more concurrent users on identical hardware.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844