2602.08686v1 Feb 09, 2026 cs.LG

CompilerKV: 오프라인 경험 컴파일을 통한 위험 적응형 KV 압축

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

Chengzhi Wang

Citations: 108

h-index: 5

Yibo Liu

Citations: 24

h-index: 2

Baoliang Tian

Citations: 25

h-index: 3

Haijun Zhang

Citations: 74

h-index: 3

Ning Yang

Citations: 28

h-index: 2

장문 맥락에서 대규모 언어 모델(LLM)은 키-값(KV) 캐시 메모리의 선형적 증가로 인해 심각한 제약을 받습니다. 기존의 KV 압축 방법은 정적 임계값 및 어텐션 기반 휴리스틱 또는 대략적인 메모리 할당에 의존합니다. 제한된 메모리 환경에서 이러한 방법은 압축 위험의 프롬프트 의존적 변동성과 어텐션 헤드 간의 기능적 이질성이라는 두 가지 중요한 요소를 간과하여 토큰 선택을 불안정하게 만들고 성능 저하를 초래합니다. 이러한 문제점을 해결하기 위해, 우리는 오프라인 경험을 재사용 가능한 의사 결정 테이블로 컴파일하여 프리필(prefill) 환경에 적용할 수 있는 위험 적응형 및 헤드 인식 압축 프레임워크인 CompilerKV를 제안합니다. CompilerKV는 두 가지 핵심적인 시너지 효과를 가진 구성 요소로 구성됩니다. (i) 오프라인 컨텍스추얼 밴딧을 통해 학습된 '헤드 이질성 테이블(Head Heterogeneity Table)'은 각 헤드에 대한 신뢰성 가중치를 할당하여 어텐션 헤드 간의 기능적 차이를 명시적으로 제어합니다. (ii) '위험 적응형 임계값 게이팅(Risk-Adaptive Threshold Gating)' 메커니즘은 어텐션 엔트로피와 로컬 퍼플렉시티를 동시에 모델링하여 프롬프트 수준의 위험을 배포 가능한 보존 임계값으로 변환합니다. LongBench 데이터셋에 대한 실험 결과, CompilerKV는 512 토큰의 메모리 예산 하에서 최첨단(SOTA) 방법보다 우수한 성능을 보이며, FullKV 성능의 97.7%를 회복하는 동시에 가장 강력한 경쟁 모델보다 최대 +5.2 포인트의 성능 향상을 달성했습니다.

Original Abstract

Large Language Models (LLMs) in long-context scenarios are severely constrained by the linear growth of Key-Value (KV) cache memory. Existing KV compression methods rely either on static thresholds and attention-only heuristics or on coarse memory budget allocation. Under tight memory budgets, these methods overlook two key factors: prompt-dependent variation in compression risk and functional heterogeneity across attention heads, which destabilize token selection and lead to tail failures. To address these challenges, we propose CompilerKV, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment. CompilerKV integrates two key synergistic components: (i) a Head Heterogeneity Table, learned via offline contextual bandits, which assigns head-specific reliability weights to govern functional differences across attention heads explicitly; and (ii) a Risk-Adaptive Threshold Gating mechanism that jointly models attention entropy and local perplexity, transforming prompt-level risk into deployable retention thresholds. Experiments on LongBench show CompilerKV dominates SOTA methods under a 512-token budget, recovering 97.7\% of FullKV performance while achieving up to +5.2 points gain over the strongest competitor.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!