2602.08585v1 Feb 09, 2026 cs.LG

미래 유틸리티 예측: 작업에 독립적인 KV 캐시 제거를 위한 글로벌 조합 최적화

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Ziyao Tang

Citations: 1

h-index: 1

Pengkun Jiao

Citations: 48

h-index: 4

Xinhang Chen

Citations: 6

h-index: 1

Wei Liu

Citations: 0

h-index: 0

Shiyong Li

Citations: 3

h-index: 1

Jingjing Chen

Citations: 3,360

h-index: 7

어텐션 연산의 2차 복잡성으로 인해, 모델 추론 속도를 높이기 위해서는 KV 캐시 제거가 매우 중요합니다. 현재의 KV 캐시 제거 방법은 일반적으로 즉각적인 휴리스틱 지표에 의존하며, 이는 모든 헤드에서 점수 크기가 중요도의 일관된 지표라는 것을 암묵적으로 가정합니다. 그러나 이는 어텐션 헤드 간의 예측 정확도 차이를 간과합니다. 일부 헤드는 토큰의 즉각적인 기여도를 우선시하는 반면, 다른 헤드는 장기적인 유틸리티를 포착하는 데 전념합니다. 본 논문에서는 최적의 예산 할당이 장기적인 의미 정보 보존에 대한 한계 유틸리티에 의해 결정되어야 한다고 제안합니다. 이러한 통찰력을 바탕으로, 우리는 헤드 수준의 예산 할당을 최적화하기 위해 볼록 헐 이완(convex-hull relaxation)과 한계 유틸리티 기반의 탐욕적(greedy) 솔버를 사용하는 새로운 프레임워크인 LU-KV를 제안합니다. 또한, LU-KV의 실제 배포를 용이하게 하기 위해 데이터 기반의 오프라인 프로파일링 프로토콜을 구현했습니다. LongBench 및 RULER 벤치마크에 대한 광범위한 평가 결과, LU-KV는 KV 캐시 크기를 80% 줄이면서도 성능 저하를 최소화하고, 동시에 추론 지연 시간과 GPU 메모리 사용량을 줄이는 것을 확인했습니다.

Original Abstract

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Extensive evaluations on LongBench and RULER benchmarks demonstrate that LU-KV achieves an 80% reduction in KV cache size with minimal performance degradation, while simultaneously reducing inference latency and GPU memory footprint.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!