2602.12618v1 Feb 13, 2026 cs.CV

어텐션 기반 자기 압축을 통한 시각 토큰 감소: 효율적인 다중 모드 대규모 언어 모델을 위한 방법

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Rui Mao

Citations: 95

h-index: 3

Omer Faruk Deniz

Citations: 0

h-index: 0

Ruochen Li

University of Texas at Dallas

Citations: 147

h-index: 6

Yapeng Tian

Citations: 43

h-index: 4

Latifur Khan

Citations: 0

h-index: 0

다중 모드 대규모 언어 모델(MLLM)은 수많은 시각 토큰을 모든 LLM 레이어를 통해 처리하면서 상당한 계산 비용을 발생시킵니다. 기존의 가지치기 방법은 LLM 이전에 적용되어 다양한 인코더-프로젝터 설계로 인해 일반화에 제한을 받거나, LLM 내에서 휴리스틱을 사용하여 FlashAttention과 호환되지 않는 경우가 많습니다. 우리는 다른 접근 방식을 취했습니다. 중요하지 않은 토큰을 식별하는 대신, LLM 자체를 압축을 위한 최적의 가이드로 간주합니다. 더 깊은 레이어가 자연스럽게 시각-텍스트 정보를 전달한다는 것을 관찰하여, LLM의 어텐션 메커니즘만을 사용하여 시각 토큰을 점진적으로 줄이는 간단하고 광범위하게 적용 가능한 방법인 어텐션 기반 자기 압축(ADSC)을 제안합니다. 우리의 방법은 선택된 레이어에서 균일한 토큰 다운샘플링을 적용하여 병목 현상을 만들어 모델이 정보를 재구성하고 남은 토큰으로 압축하도록 유도합니다. 이는 별도의 점수 계산, 보조 모듈 또는 어텐션 수정이 필요 없으며, FlashAttention과 완벽하게 호환됩니다. LLaVA-1.5에 ADSC를 적용한 결과, FLOPs는 53.7% 감소하고 최고 KV-캐시 메모리는 56.7% 감소했으며, 원래 모델 성능의 98.2%를 유지했습니다. 여러 벤치마크에서 우리의 방법은 효율성과 정확성 모두에서 기존의 가지치기 접근 방식보다 우수한 성능을 보입니다. 특히, 높은 압축 비율에서도 우리의 방법은 안정적인 성능을 유지하는 반면, 휴리스틱 기반 기술은 성능이 급격하게 저하됩니다.

Original Abstract

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!