2602.08005v1 Feb 08, 2026 cs.CL

DeltaKV: 잔차 기반 KV 캐시 압축 기술 - 장거리 유사성 활용

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao

Citations: 87

h-index: 5

Yaowei Wang

Citations: 52

h-index: 4

Min Zhang

Citations: 45

h-index: 3

Jun Yu

Citations: 65

h-index: 4

Qiang Huang

Harbin Institute of Technology (Shenzhen)

Citations: 792

h-index: 14

자율 에이전트, 장기 추론, 창작 글쓰기 등 다양한 응용 분야에서 효율적인 장문(long-context) LLM을 사용하는 것은 KV 캐시 메모리의 선형적 증가로 인해 근본적인 병목 현상을 겪고 있습니다. 기존의 압축 및 제거 방법은 종종 정확도, 압축률, 하드웨어 효율성 간의 균형을 맞추는 데 어려움을 겪습니다. 본 논문에서는 경험적으로 밝혀진 두 가지 사실, 즉 장거리 토큰 간의 유사성과 KV 표현에서 높은 공유 잠재적 구성 요소의 존재를 기반으로 하는 잔차 기반 KV 캐시 압축 프레임워크인 DeltaKV를 제안합니다. DeltaKV는 토큰을 삭제하는 대신, 검색된 과거 참조에 대한 의미적 잔차를 인코딩하여 저장 공간을 크게 줄이면서도 정확도를 유지합니다. 또한, 압축으로 얻은 이점을 실제 시스템 성능 향상으로 이어지도록, 분리된 메모리 관리 및 희소(sparse)하고 불규칙한 KV 레이아웃에 최적화된 커널을 갖춘 고성능 추론 엔진인 Sparse-vLLM을 추가로 개발했습니다. 실험 결과, DeltaKV는 KV 캐시 메모리를 원래 크기의 29%로 줄이면서 LongBench, SCBench 및 AIME 데이터셋에서 거의 손실 없는 정확도를 유지하는 것으로 나타났습니다. Sparse-vLLM과 통합했을 때, DeltaKV는 장문 환경에서 vLLM보다 최대 2배의 처리량 향상을 달성하며, 확장 가능한 장문 LLM 배포를 위한 실용적인 방법을 제시합니다. 코드, 모델 체크포인트 및 데이터셋은 https://github.com/CURRENTF/Sparse-vLLM 에서 확인할 수 있습니다.

Original Abstract

The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29\% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse-vLLM, it achieves up to 2$\times$ throughput improvement over vLLM in long-context scenarios, demonstrating a practical path toward scalable long-context LLM deployment. Code, model checkpoints, and datasets are available at https://github.com/CURRENTF/Sparse-vLLM.

4 Citations

1 Influential

47.215256339173 Altmetric

242.1 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!