2603.12938v1 Mar 13, 2026 cs.CV

스트리밍 비디오에 대한 사고: 실시간 비디오 이해를 위한 프레임워크

Thinking in Streaming Video

Zikang Liu

Citations: 19

h-index: 2

Longteng Guo

Citations: 1,547

h-index: 17

Handong Li

Citations: 469

h-index: 5

Ru Zhen

Citations: 13

h-index: 1

Xingjian He

Citations: 582

h-index: 11

Ruyi Ji

Citations: 26

h-index: 3

Xiaoming Ren

Citations: 7

h-index: 2

Yanhao Zhang

Citations: 96

h-index: 4

Haonan Lu

Citations: 90

h-index: 4

Jing Liu

Citations: 50

h-index: 3

실시간으로 연속적인 비디오 스트림을 이해하는 것은 동적인 환경에서 작동하는 인터랙티브 어시스턴트 및 다중 모드 에이전트에 필수적입니다. 그러나 대부분의 기존 비디오 추론 방식은 전체 비디오 컨텍스트가 관찰될 때까지 추론을 연기하는 배치 방식을 따르므로, 높은 지연 시간과 증가하는 계산 비용이 발생하여 스트리밍 시나리오와 호환되지 않습니다. 본 논문에서는 Watch-Think-Speak 패러다임을 기반으로 스트리밍 비디오 추론을 위한 프레임워크인 ThinkStream을 소개합니다. 이 프레임워크는 모델이 새로운 비디오 관찰 내용이 도착함에 따라 이해도를 점진적으로 업데이트할 수 있도록 합니다. 각 단계에서 모델은 짧은 추론 업데이트를 수행하고, 응답을 생성하기에 충분한 증거가 축적되었는지 여부를 결정합니다. 장기적인 스트리밍을 지원하기 위해, 우리는 중간 추론 과정을 압축된 의미 기억으로 취급하여 오래된 시각적 토큰을 대체하면서 필수적인 컨텍스트를 유지하는 Reasoning-Compressed Streaming Memory (RCSM)를 제안합니다. 또한, 우리는 스트리밍 상호 작용의 요구 사항에 맞춰 점진적인 추론과 응답 타이밍을 일치시키는 Streaming Reinforcement Learning with Verifiable Rewards 방식을 사용하여 모델을 학습시켰습니다. 여러 스트리밍 비디오 벤치마크에 대한 실험 결과, ThinkStream은 기존의 온라인 비디오 모델보다 훨씬 뛰어난 성능을 보이며, 낮은 지연 시간과 메모리 사용량을 유지합니다. 코드, 모델 및 데이터는 https://github.com/johncaged/ThinkStream 에서 공개됩니다.

Original Abstract

Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream

1 Citations

0 Influential

38.897207708399 Altmetric

195.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!