2601.18999v1 Jan 26, 2026 cs.LG

랜덤화가 KV 캐싱 성능을 향상시키고, 학습 기반 방식이 쿼리 부하를 균형 있게 분산시키는 방법: 통합적 관점

Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

Qiuyi Zhang

Citations: 7

h-index: 1

Sandeep Silwal

Citations: 136

h-index: 4

Fangzhou Wu

Citations: 337

h-index: 7

KV 캐싱은 이전 쿼리에서 얻은 키-값(KV) 쌍을 재사용하여 대규모 언어 모델(LLM) 추론을 가속화하는 기본적인 기술이지만, 제한된 메모리 환경에서의 성능은 캐시 교체 정책에 매우 민감합니다. 기본적으로 사용되는 최소 최근 사용(LRU) 교체 알고리즘은 특히 여러 LLM을 동시에 사용하는 환경에서 동적인 쿼리 도착에 어려움을 겪습니다. 이러한 환경에서는 작업자 간의 쿼리 부하 균형과 각 작업자의 캐시 적중률을 최대화하는 것이 상반되는 목표입니다. 본 연구에서는 KV 캐시 교체와 쿼리 라우팅 간의 핵심적인 상충 관계를 포괄하는 최초의 통합적인 수학적 모델을 제시합니다. 분석 결과, 기존 방법의 이론적인 한계를 밝히고, 검증 가능한 경쟁력을 갖는 랜덤 기반 KV 캐시 교체 알고리즘과 학습 기반 방법을 결합하여 쿼리 패턴 변화에 적응적으로 대응하고, 쿼리 부하와 캐시 적중률을 균형 있게 조절하는 방법을 제안합니다. 이론적인 결과는 4개의 벤치마크와 3가지 프리픽스 공유 환경에서의 광범위한 실험을 통해 검증되었으며, 캐시 적중률 6.92배 향상, 지연 시간 11.96배 감소, 첫 번째 토큰 생성 시간(TTFT) 14.06배 감소, 처리량 77.4% 증가와 같은 성능 향상을 확인했습니다. 본 연구의 코드는 https://github.com/fzwark/KVRouting 에서 확인할 수 있습니다.

Original Abstract

KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to 6.92$\times$ in cache hit rate, 11.96$\times$ reduction in latency, 14.06$\times$ reduction in time-to-first-token (TTFT), and 77.4% increase in throughput over the state-of-the-art methods. Our code is available at https://github.com/fzwark/KVRouting.

1 Citations

0 Influential

23.5 Altmetric

118.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!