2604.08120v1 Apr 09, 2026 cs.CV

작은 비전-언어 모델은 장편 비디오 이해를 위한 효율적인 압축 기술이다

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Lemeng Wu

Citations: 603

h-index: 6

Zechun Liu

Citations: 1,928

h-index: 10

Chong Zhou

Citations: 50

h-index: 4

Raghuraman Krishnamoorthi

Citations: 4,880

h-index: 18

Vikas Chandra

Citations: 1,902

h-index: 15

Mohamed Elhoseiny

Citations: 26

h-index: 2

Wei Wen

Citations: 18

h-index: 3

Jun Chen

Citations: 104

h-index: 5

Junjie Fei

Citations: 26

h-index: 4

Junlin Han

Citations: 63

h-index: 4

Mingchen Zhuge

Citations: 44

h-index: 2

Saksham Suri

Citations: 42

h-index: 3

Qiangbo Qian

Citations: 0

h-index: 0

Yunyang Xiong

Citations: 2,032

h-index: 10

Shuming Liu

Citations: 6

h-index: 1

Chenchen Zhu

Citations: 663

h-index: 7

멀티모달 대규모 언어 모델(MLLM)을 시간당 영상과 같은 장시간 영상에 적용하는 것은 컨텍스트 제한으로 인해 어려움을 겪는다. 밀집된 시각 정보는 토큰 예산을 초과시키고, '중간 정보 손실' 현상을 악화시킨다. 기존의 휴리스틱 방법, 예를 들어 희소 샘플링 또는 균일 풀링은 결정적인 순간을 무시하고 관련 없는 배경에 대역폭을 낭비함으로써 충실도를 맹목적으로 희생한다. 본 연구에서는 장편 영상을 효율적으로 압축하여 후속 작업을 수행할 수 있도록 하는 쿼리 기반 프레임워크인 Tempo를 제안한다. Tempo는 작은 비전-언어 모델(SVLM)을 로컬 시간 압축기로 활용하여 토큰 감소를 초기 크로스 모달 증류 과정으로 설정함으로써 단일 순전파 과정에서 간결하고 목적에 맞는 표현을 생성한다. 인과 관계를 깨뜨리지 않고 엄격한 예산을 적용하기 위해, 우리는 적응형 토큰 할당(ATA)을 도입한다. ATA는 SVLM의 제로샷 관련성 사전 지식과 의미론적 전처리 기능을 활용하여 학습이 필요 없는 O(1) 동적 라우터 역할을 한다. ATA는 쿼리에서 중요한 부분에 충분한 대역폭을 할당하고, 전반적인 스토리라인을 유지하기 위해 불필요한 부분을 최소한의 시간 단위로 압축한다. 광범위한 실험 결과, 60억 개의 파라미터로 구성된 Tempo 모델은 공격적인 동적 압축(0.5-16 토큰/프레임)에도 불구하고 최첨단 성능을 달성했다. 극도로 긴 LVBench 데이터셋(4101초)에서 Tempo는 엄격한 8K 시각 예산 하에서 52.3의 점수를 기록하여 GPT-4o 및 Gemini 1.5 Pro를 능가했으며, 2048 프레임으로 확장했을 때 53.7의 점수를 기록했다. 특히, Tempo는 장시간 영상을 이론적 한계 이하로 압축하여, 진정한 장편 비디오 이해는 탐욕적으로 채워진 컨텍스트 윈도우가 아닌 목적 기반의 효율성에 달려 있다는 것을 입증한다.

Original Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!