2603.11896v1 Mar 12, 2026 cs.CV

보는 동시에 생각하기: 다중 모달 대규모 언어 모델에서 다중 회전 비디오 추론을 위한 온라인 스트리밍 세그먼트 레벨 메모리

Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Beijing

Citations: 233

h-index: 8

Chinese Academy of Sciences

Citations: 6,359

h-index: 35

Zhuoran Jin

Citations: 607

h-index: 13

Yubo Chen

Institute of Automation, Chinese Academy of Sciences

Citations: 4,818

h-index: 27

Yupu Hao

Citations: 50

h-index: 5

Yulong Ao

Citations: 997

h-index: 7

Jun Zhao The Key Laboratory of Cognition

Citations: 0

h-index: 0

Decision Systems.

Citations: 2,005

h-index: 21

Institute of Automation

Citations: 136

h-index: 2

China

Citations: 3

h-index: 1

Beijing Academy of Artificial Intelligence

Citations: 0

h-index: 0

Kang Liu

Citations: 218

h-index: 9

Lu Wang

Citations: 2

h-index: 1

다중 모달 대규모 언어 모델(MLLM)은 오프라인 비디오 이해에서 뛰어난 성능을 보여주었지만, 대부분은 오프라인 추론에 제한되거나 온라인 추론 능력이 약하여 지속적으로 입력되는 비디오 스트림에 대한 다중 회전 상호 작용이 어렵습니다. 기존의 스트리밍 방식은 일반적으로 인지-생성 과정을 번갈아 수행하는 방식을 사용하는데, 이는 동시적인 인지와 생성을 방해하고 스트림이 증가함에 따라 초기 메모리 감쇠를 유발하여 장거리 의존성 모델링에 부정적인 영향을 미칩니다. 본 논문에서는 다중 회전 상호 작용 동안 지속적인 세그먼트 레벨 메모리를 유지하는 메모리 기반 스트리밍 비디오 추론 프레임워크인 '보는 동시에 생각하기'를 제안합니다. 우리는 세 단계로 구성된 다중 회전 추론 데이터셋을 구축하고, 세그먼트 레벨 스트리밍 인과 마스크 및 스트리밍 위치 인코딩을 통해 엄격한 인과 관계를 적용하는 동시에, 단계에 맞는 학습 전략을 채택했습니다. 추론 과정에서, 우리는 비디오 시청과 사고 과정을 겹쳐서 처리하는 효율적인 파이프라인을 도입하고, 최적의 어텐션 백엔드를 적응적으로 선택합니다. 단일 회전 및 다중 회전 스트리밍 입력 프로토콜 모두에서, 제안하는 방법은 뛰어난 결과를 달성했습니다. Qwen3-VL을 기반으로 구축되었으며, StreamingBench에서 단일 회전 정확도를 2.6% 향상시키고, OVO-Bench에서 3.79% 향상시켰습니다. 다중 회전 설정에서는 성능을 유지하면서 출력 토큰 수를 56% 줄였습니다. 코드: https://github.com/wl666hhh/Think_While_Watching/

Original Abstract

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

0 Citations

0 Influential

45.547189562171 Altmetric

227.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!