2601.07667v1 Jan 12, 2026 cs.CL

LLM 추론을 위한 계층별 토큰 가지치기에서 적응적 계층 선택

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Chuan Xiao

Citations: 77

h-index: 3

Rei Taniguchi

Citations: 22

h-index: 2

Yuyang Dong

Citations: 378

h-index: 8

Makoto Onizuka

Citations: 343

h-index: 6

최근 대규모 언어 모델(LLM)의 보편화로 인해 LLM 추론에서의 키-값(KV) 캐시 감소는 상당한 주목을 받고 있습니다. 최근 제안된 다양한 방법 중, 특정 계층에서 토큰의 부분집합을 선택하여 KV 캐시에 저장하고 나머지는 가지치는 계층별 토큰 가지치기 방식이 가장 인기 있는 방법 중 하나입니다. 이러한 방식은 일반적으로 토큰 선택이 이루어지는 계층 집합을 미리 정의하는데, 이는 설계의 유연성이 부족하여 작업에 따라 정확도가 크게 달라지고, KV 검색과 같은 어려운 작업에서는 성능이 저하되는 단점이 있습니다. 본 논문에서는 어텐션 점수에 따른 토큰 순위의 변동성을 활용하여 KV 캐시 감소를 위한 계층 선택을 적응적으로 수행하는 훈련이 필요 없는 방법인 ASL을 제안합니다. 제안된 방법은 다양한 작업에서의 성능 균형을 유지하면서 사용자가 지정한 KV 예산 요구 사항을 충족합니다. ASL은 프리필링 단계에서 작동하며, SnapKV와 같은 기존의 KV 캐시 감소 방법과 함께 사용하여 디코딩 단계를 최적화할 수 있습니다. InfiniteBench, RULER, 및 NIAH 벤치마크를 통해 평가한 결과, ASL은 한 번의 토큰 선택을 통해 토큰을 선택하고 더 깊은 계층으로 전파하는 방식으로, 정확도 측면에서 최첨단 계층별 토큰 선택 방법보다 우수한 성능을 보이며, 디코딩 속도와 KV 캐시 감소 효과를 유지합니다.

Original Abstract

Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.

2 Citations

0 Influential

4 Altmetric

22.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!