arXiv:2602.08585v2 Announce Type: replace-cross
Abstract: Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Building on this insight, we propose LU-KV, a novel framework that formulates head-level budget allocation as a global combinatorial optimization problem to maximize the long-horizon marginal contribution of reserved tokens. To solve this non-convex problem, we employ a convex-hull relaxation and a marginal-utility-based greedy solver, achieving near-optimal solutions. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Evaluations on LongBench and RULER benchmarks demonstrate that LU-KV reduces KV cache size by 80% with minimal performance degradation, while also decreasing inference latency and GPU memory footprint.
Digital health tools and point solutions—pitfalls in population health program measurement
Digital health tools are generally poorly regulated and often lack strong research evidence, posing challenges for purchasers of point solutions such as employer groups and