2606.09508v1 Jun 08, 2026 cs.AI

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Qing Li
Qing Li
Citations: 151
h-index: 2
Haoyang Li
Haoyang Li
Citations: 451
h-index: 10
Fei Teng
Fei Teng
The Hong Kong University of Science and Technology
Citations: 35
h-index: 3
Zhanchao Xu
Zhanchao Xu
Citations: 149
h-index: 2
Q. Xiao
Q. Xiao
Citations: 36
h-index: 3
Chen Jason Zhang
Chen Jason Zhang
Citations: 48
h-index: 2
Lei Chen
Lei Chen
Citations: 21
h-index: 3

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

0 Citations
0 Influential
25 Altmetric
125.0 Score
Original PDF
0

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!