2602.13315v1 Feb 10, 2026 cs.CV

IDPruner: 시각 토큰 가지치기에서 중요성과 다양성을 조화시키는 방법론 (MLLM)

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs

Yifan Tan

Citations: 34

h-index: 3

Yifu Sun

Citations: 529

h-index: 6

Shirui Huang

Citations: 135

h-index: 4

Hong Liu

Citations: 303

h-index: 2

Guanghua Yu

Citations: 50

h-index: 4

Jianchen Zhu

Citations: 13

h-index: 3

Yangdong Deng

Citations: 5

h-index: 1

다중 모드 대규모 언어 모델(MLLM)은 뛰어난 성능을 보여주지만, 막대한 양의 시각 토큰으로 인해 상당한 계산 병목 현상을 겪습니다. 따라서 시각 토큰 가지치기는 MLLM 추론 속도를 향상시키는 데 중요한 기술로 부상했습니다. 기존 방법들은 토큰의 중요성, 다양성, 또는 직관적인 조합에 초점을 맞추지만, 이러한 요소들을 최적으로 통합하기 위한 체계적인 프레임워크는 부족합니다. 이 문제를 해결하기 위해, 우리는 먼저 토큰 중요성과 의미적 다양성 간의 균형을 분석합니다. 이러한 분석을 바탕으로, 우리는 최대 주변 관련성(MMR) 알고리즘을 활용하여 중요성과 다양성 간의 파레토 최적 균형을 달성하는 방법인 extbf{I}mportance and extbf{D}iversity Pruner ( extbf{IDPruner})를 제안합니다. 특히, 당사 방법은 어텐션 맵을 필요로 하지 않아 FlashAttention과의 완벽한 호환성을 보장하며, 원샷 가지치기를 통해 효율적인 배포가 가능합니다. 다양한 모델 아키텍처와 다중 모드 벤치마크에 대한 광범위한 실험을 통해 IDPruner가 최첨단 성능을 달성하고 다양한 아키텍처와 작업에서 우수한 일반화 성능을 보임을 입증했습니다. 주목할 만한 점은, Qwen2.5-VL-7B-Instruct 모델에서 IDPruner가 토큰의 75%를 가지치더라도 기준 성능의 95.18%를 유지하며, 심지어 90%의 극단적인 가지치기 비율에서도 86.40%의 성능을 유지한다는 것입니다. 당사 코드의 GitHub 주소는 https://github.com/Tencent/AngelSlim 입니다.

Original Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18\% of baseline performance when pruning 75\% of the tokens, and still maintains 86.40\% even under an extreme 90\% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.

0 Citations

0 Influential

53.774290470082 Altmetric

268.9 Score

Original PDF

470

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!