arXiv:2508.06526v3 Announce Type: replace-cross
Abstract: As large-scale language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead.
We introduce textbfPiKV, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages textitexpert-sharded KV storage to partition caches across GPUs, textitPiKV routing to reduce token-to-KV access, and a textitPiKV Scheduling to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates textitPiKV Compression modules the caching pipeline for acceleration.
PiKV is recently publicly available as an open-source software library: hrefhttps://github.com/NoakLiu/PiKVhttps://github.com/NoakLiu/PiKV. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.
Explainable AI in kidney stone detection and segmentation: a mini review
Kidney stones are one of the most common renal disorders that can produce severe complications if not diagnosed and treated early. Recently, advances in AI