2603.05950v1 Mar 06, 2026 cs.CV

에너지 기반 적응형 시각 토큰 가지치기를 통한 효율적인 비전-언어 모델

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

Citations: 2

h-index: 1

Citations: 4

h-index: 1

시각-언어 모델(VLM)의 속도 향상을 위해 시각 토큰 수를 줄이는 것은 매우 중요하지만, 대부분의 기존 방법은 모든 입력에 대해 고정된 예산을 사용하며, 이미지 정보 밀도의 상당한 차이를 간과합니다. 본 논문에서는 시각 특징 공간의 고유값 스펙트럼으로부터 토큰 예산을 결정하는 에너지 기반의 적응형 가지치기 프레임워크인 E-AdaPrune을 제안합니다. E-AdaPrune은 특정 비율의 스펙트럼 에너지를 유지함으로써, 정보가 풍부한 장면에는 더 많은 토큰을 할당하고, 불필요한 토큰은 적극적으로 압축하며, 추가적인 학습 가능한 파라미터를 도입하지 않습니다. E-AdaPrune을 9개의 벤치마크와 세 가지 VLM 백본(LLaVA-1.5-7B, LLaVA-1.5-13B, LLaVA-NeXT-8B)에 대해 평가한 결과, 동일한 평균 토큰 예산 하에서 E-AdaPrune은 평균적으로 최대 0.6%의 성능 향상을 보였으며, 특히 MMVet 추론 작업에서 5.1%의 상당한 상대적인 성능 향상을 달성했습니다. 난수화된 고유값 분해를 사용함으로써, 추가적인 지연 시간은 이미지당 8ms로 제한됩니다.

Original Abstract

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

1 Citations

0 Influential

0.5 Altmetric

3.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!