2602.02951v1 Feb 03, 2026 cs.CV

뉘와(Nüwa): VLM 토큰 가지치기에 의해 손상된 공간적 완전성을 복원하는 방법

Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning

Yihong Huang

Citations: 22

h-index: 3

Fei Ma

Citations: 28

h-index: 3

Yihua Shao

Citations: 185

h-index: 7

Jing Guo

Citations: 36

h-index: 3

Zitong Yu

Citations: 246

h-index: 4

Laizhong Cui

Citations: 41

h-index: 3

Qi Tian

Citations: 38

h-index: 3

비전 토큰 가지치기는 효율적인 비전 언어 모델(VLM)을 위한 효과적인 가속 기술로 입증되었습니다. 그러나 기존의 가지치기 방법은 시각적 질문 답변(VQA)에서는 우수한 성능을 유지하지만, 시각적 객체 지시(VG) 작업에서는 상당한 성능 저하를 보입니다. VLM의 처리 파이프라인 분석 결과, 전역 의미 유사성과 어텐션 점수를 활용하는 전략은 토큰의 위치 정보 상호작용에서 파생되는 전역적인 공간 참조 프레임을 잃게 됩니다. 이러한 발견에 따라, 우리는 효율적인 특징 집계를 가능하게 하면서도 공간적 완전성을 유지하는 두 단계의 토큰 가지치기 프레임워크인 '뉘와(Nüwa)'를 제안합니다. 첫 번째 단계에서는 비전 인코더 이후에 분리, 정렬, 집계라는 세 가지 작업을 수행하며, 이는 군집 지능 알고리즘에서 영감을 받아 정보가 풍부한 전역적인 공간 고정점을 유지합니다. 두 번째 단계에서는 LLM 내에서 텍스트 기반의 가지치기를 수행하여 작업과 관련된 시각적 토큰을 유지합니다. 광범위한 실험 결과, '뉘와(Nüwa)'는 여러 VQA 벤치마크에서 최첨단 성능을 달성하며(94%에서 95%로 향상), 시각적 객체 지시 작업에서 상당한 성능 향상을 보입니다(7%에서 47%로 향상).

Original Abstract

Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM's processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens' positional information. Motivated by these findings, we propose $\text{Nüwa}$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that $\text{Nüwa}$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).

3 Citations

0 Influential

3.5 Altmetric

20.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!