2606.11164v1 Jun 09, 2026 cs.AI

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Shuang Qiu
Shuang Qiu
Citations: 1
h-index: 1
Peisong Wang
Peisong Wang
Citations: 342
h-index: 7
Yunhe Li
Yunhe Li
Citations: 15
h-index: 2
Hanxu Hou
Hanxu Hou
Citations: 61
h-index: 5
WeiZhi Fei
WeiZhi Fei
Tsinghua University
Citations: 137
h-index: 6
Wenhao Liu
Wenhao Liu
Citations: 1
h-index: 1
Haomin Shi
Haomin Shi
Citations: 32
h-index: 3
Xiangyu Wang
Xiangyu Wang
Citations: 639
h-index: 14
Mengzhe Ruan
Mengzhe Ruan
Citations: 41
h-index: 4
Linqi Song
Linqi Song
Citations: 1
h-index: 1

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

0 Citations
0 Influential
7 Altmetric
35.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!