arXiv:2603.04427v3 Announce Type: replace-cross
Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) — far fewer than value transfer needs.
We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch — unlike GQA and MLA, which must be designed into the architecture before pretraining. We factorize each key projection $W_K approx A_d times r B_r times d$ via truncated SVD (where $r = d_textselect$), set $W_K’ = A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^top$ into the query projection ($W_Q’ = W_Q B^top$) at zero cost — since queries are never cached. At 7B scale, training from scratch with $r = d_textmodel/4$ matches full-attention perplexity (9.2 vs 9.3 PPL after 20B tokens) while using 12% fewer parameters and training 8% faster. For existing models, SVD + QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at approximately 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to $16times$ combined key cache compression. For a 7B model serving 128K context, factored keys save 25 GB of KV cache per user, enabling approximately 60% more concurrent users on identical hardware.
Generative AI-assisted Participatory Modeling in Socio-Environmental Planning under Deep Uncertainty
arXiv:2603.17021v1 Announce Type: new Abstract: Socio-environmental planning under deep uncertainty requires researchers to identify and conceptualize problems before exploring policies and deploying plans. In practice

