2602.15318v1 Feb 17, 2026 cs.CV

Sparrow: 텍스트 기반 윈도우 어텐션과 시각-의미적 정보 추출을 통한 비디오 LLM의 추론 가속화

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Libo Zhang

Citations: 24

h-index: 3

Zhaoning Zhang

Citations: 18

h-index: 3

Wangyang Hong

Citations: 4

h-index: 1

Dongsheng Li

Citations: 29

h-index: 3

P. Qiao

Citations: 601

h-index: 13

추론 속도를 높이기 위해 널리 사용되는 추론 기법인 '스펙울레이티브 디코딩'은 비전-언어 모델(VLM)에서는 효과적이지만, 비디오 대규모 언어 모델(Vid-LLM)에 적용할 경우 심각한 성능 저하를 야기합니다. 이는 키-값 캐시 폭주 및 컨텍스트 윈도우 불일치로 인해 발생하는 어텐션 희석 및 부정적인 시각 정보 획득 문제를 초래하기 때문입니다. 저희는 Vid-LLM에서 중요한 시각적 의미가 심층 상호 작용 과정에서 텍스트 숨겨진 상태에 암묵적으로 내재화된다는 현상을 관찰했습니다. 이는 심층 추론 과정에서 원시 시각 정보가 구조적으로 중복될 수 있음을 시사합니다. 이러한 문제를 해결하기 위해, 저희는 'Sparrow' 프레임워크를 제안합니다. Sparrow는 먼저 숨겨진 상태 재사용을 통한 시각 정보를 고려한 텍스트 기반 윈도우 어텐션을 활용하여 시각적 계산을 대상 모델로 완전히 오프로드하고, 중간 레이어의 시각적 상태를 활용하여 의미가 풍부한 중간 상태로 초안 모델을 학습시켜 저수준 시각 노이즈를 제거합니다. 또한, 학습-추론 분포 간의 불일치를 해소하기 위해 멀티 토큰 예측 전략을 도입했습니다. 실험 결과, Sparrow는 25,000개의 시각 토큰을 사용하더라도 평균 2.82배의 속도 향상을 달성했으며, 이는 긴 시퀀스에서의 성능 저하를 효과적으로 해결하고 실시간 장비디오 작업에 대한 실용적인 솔루션을 제공합니다.

Original Abstract

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.

3 Citations

0 Influential

6.5 Altmetric

35.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!