2604.05887v1 Apr 07, 2026 cs.AI

HybridKV: 효율적인 다중 모드 대규모 언어 모델 추론을 위한 하이브리드 키-값 캐시 압축

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Xiao Gu

Citations: 2

h-index: 1

Fei Ren

Citations: 1,894

h-index: 5

Ke Chen

Citations: 120

h-index: 6

Bowen Zeng

Citations: 5

h-index: 2

Lidan Shou

Citations: 18

h-index: 1

Huan Li

Citations: 84

h-index: 6

Jun Zhang

Citations: 55

h-index: 2

다중 모드 대규모 언어 모델(MLLM)은 텍스트, 이미지 및 비디오에 대한 통합적인 추론 능력을 향상시켰지만, 키-값(KV) 캐시의 급격한 증가로 인해 추론 성능이 저하되는 문제가 있습니다. 각 시각적 입력은 수천 개의 토큰으로 확장되어 캐시가 컨텍스트 길이에 따라 선형적으로 증가하며, 디코딩 과정 동안 GPU 메모리에 상주하여 상당한 메모리 오버헤드와 지연 시간을 유발합니다. 일반적인 해결책은 다양한 수준에서 고정된 할당 예산을 기준으로 캐시를 압축하는 것입니다. 토큰 수준에서는 중요도가 낮은 토큰을 균일하게 삭제하고, 레이어 수준에서는 레이어별로 유지 비율을 조정하며, 헤드 수준에서는 헤드별로 예산을 재분배합니다. 그러나 이러한 접근 방식은 할당에만 집중하고, 다양한 주의 집중 헤드의 이질적인 동작을 고려하지 않아 최적의 압축 전략을 적용하지 못합니다. 본 논문에서는 텍스트 중심 주의를 사용하여 헤드를 정적 또는 동적 유형으로 분류하고, 상위-하향 예산 할당 방식을 통해 계층적으로 KV 예산을 할당하며, 마지막으로 정적 헤드는 텍스트 우선 삭제를 통해, 동적 헤드는 청크 단위 검색을 통해 압축하는 하이브리드 KV 캐시 압축 프레임워크인 HybridKV를 제안합니다. Qwen2.5-VL-7B를 사용하여 11개의 다중 모드 벤치마크에서 실험한 결과, HybridKV는 KV 캐시 메모리를 최대 7.9배까지 줄이고, 디코딩 속도를 1.52배 향상시켰으며, 전체 캐시 MLLM과 거의 동일하거나 더 나은 성능을 유지했습니다.

Original Abstract

Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to $7.9\times$ and achieves $1.52\times$ faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.

2 Citations

0 Influential

3 Altmetric

17.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!