2602.18846v1 Feb 21, 2026 cs.CV

DUET-VLM: VLM 학습 및 추론을 위한 이중 단계 통합 효율적 토큰 감소

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

P. Brahma

Citations: 352

h-index: 7

Aditya Singh

Citations: 7

h-index: 1

Hitesh Kandala

Citations: 60

h-index: 3

Zicheng Liu

Citations: 16,207

h-index: 51

E. Barsoum

Citations: 12

h-index: 3

비전-언어 모델(VLM)은 뛰어난 멀티모달 이해 및 추론 능력을 달성했지만, 밀집된 시각적 토큰화로 인해 여전히 계산 비용이 많이 든다. 기존의 효율성 향상 접근법들은 중복되는 시각적 토큰을 병합하거나 언어 백본 내에서 점진적으로 탈락(drop)시키며, 종종 속도를 위해 정확도를 희생한다. 본 연구에서는 (a) 비전 인코더의 출력을 정보가 보존되는 토큰으로 압축하는 비전 전용 중복 인식 압축과, 뒤이어 (b) 정보량이 적은 토큰을 점진적으로 가지치기하기 위해 언어 백본 내에서 주요 텍스트의 가이드를 받아 계층별로 시각적 토큰을 탈락시키는 단계로 구성된 다목적 플러그 앤 플레이 이중 압축 프레임워크인 DUET-VLM을 제안한다. 이러한 조정된 토큰 관리는 중요한 의미를 유지하면서도 공격적인 압축을 가능하게 한다. LLaVA-1.5-7B에서 우리의 접근 방식은 토큰을 67% 줄이고도 베이스라인 정확도의 99% 이상을 유지하며, 89%를 감소시킨 상황에서도 97% 이상의 정확도를 유지한다. 훈련 중에 이 이중 단계 압축을 적용하면 67% 감소 시 99.7%, 89% 감소 시 97.6%의 정확도를 달성하여, 여러 벤치마크에 걸쳐 기존 최고 수준(SoTA)의 시각적 토큰 감소 방법들을 능가한다. Video-LLaVA-7B에 통합했을 때는 베이스라인을 뛰어넘어, 53.1%의 상당한 토큰 감소에도 100% 이상의 정확도를 달성하고 93.4% 감소의 극한 설정에서도 97.6%의 정확도를 유지했다. 이러한 결과는 정확도 희생 없이 축소된 시각적(이미지/비디오) 입력에 강력하게 적응하여 동일한 계산 예산 내에서 간결하지만 의미적으로 풍부한 표현을 생성할 수 있게 하는 DUET-VLM 기반 종단간(end-to-end) 학습의 강점을 보여준다. 코드는 https://github.com/AMD-AGI/DUET-VLM 에서 확인할 수 있다.

Original Abstract

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

0 Citations

0 Influential

45.5 Altmetric

227.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!