2601.16986v1 Jan 05, 2026 cs.CL

Crystal-KV: 답변 우선 원칙을 통한 체인 오브 씽크(Chain-of-Thought) LLM의 효율적인 KV 캐시 관리

Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle

Cheng Li

Citations: 1,480

h-index: 5

Zihan Wang

Citations: 12

h-index: 1

Cheng Tang

Citations: 4

h-index: 1

Lei Gong

Citations: 1,036

h-index: 14

Chao Wang

Citations: 46

h-index: 3

Teng Wang

Citations: 6

h-index: 1

Wenqi Lou

Citations: 132

h-index: 7

Xuehai Zhou

Citations: 80

h-index: 5

대규모 언어 모델(LLM)에서 체인 오브 씽크(CoT) 추론은 복잡한 작업에서 정확도를 크게 향상시키지만, KV 캐시에 저장되는 긴 사고 과정 시퀀스로 인해 과도한 메모리 오버헤드가 발생합니다. 기존 생성 작업에서는 모든 토큰이 동일하게 중요하지만, CoT는 최종 답변에 중점을 두므로 기존의 KV 압축 전략이 효과적이지 않습니다. 본 논문에서는 CoT 추론에 특화된 효율적인 KV 캐시 관리 프레임워크인 Crystal-KV를 제시합니다. 핵심 아이디어는 '답변 우선' 원칙입니다. 답변 선호도를 사고 과정의 어텐션 맵에 매핑하여 SlipKV(주로 추론 흐름을 유지하지만 때때로 잘못된 맥락을 포함할 수 있는 항목)와 CrystalKV(최종 답변의 정확성에 실제로 기여하는 항목)를 구별합니다. 또한, 어텐션 기반의 Least Recently Frequently Used (LRFU) 알고리즘을 제안합니다. 이 알고리즘은 SlipKV 항목의 유용성이 만료되었는지 정확하게 식별하고 제거하여 추론 흐름을 방해하지 않고 CrystalKV를 유지합니다. 마지막으로, CrystalKV의 동적 비율을 기반으로 각 레이어/헤드의 중요도를 추정하고 추론 과정에서 KV 캐시 예산을 조정하는 적응적 캐시 예산 할당 알고리즘을 소개합니다. 실험 결과는 Crystal-KV가 최첨단 KV 캐시 압축 성능을 달성하고, 처리량을 크게 향상시키며, 응답 시간을 단축하는 동시에 CoT 추론의 답변 정확도를 유지하거나 향상시킨다는 것을 보여줍니다.

Original Abstract

Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks, yet incurs excessive memory overhead due to the long think-stage sequences stored in the Key-Value (KV) cache. Unlike traditional generation tasks where all tokens are uniformly important, CoT emphasizes the final answer, rendering conventional KV compression strategies ineffective. In this paper, we present Crystal-KV, an efficient KV cache management framework tailored for CoT reasoning. Our key insight is the answer-first principle. By mapping answer preferences into think-stage attention map, we distinguish between SlipKV, which mainly maintains the reasoning flow but may occasionally introduce misleading context, and CrystalKV, which truly contributes to the correctness of the final answer. Next, we propose an attention-based Least Recently Frequently Used algorithm. It precisely identifies when a SlipKV entry's utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Finally, we introduce an adaptive cache budget allocation algorithm. Based on the dynamic proportion of CrystalKV, it estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization. Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.

1 Citations

0 Influential

7 Altmetric

36.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!