2603.27593v1 Mar 29, 2026 cs.CV

STRIDE: 실시간 비디오 이해를 위한 시퀀스 디노이징과 적절한 응답 시점 결합

STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

Yonghyun Ro

Citations: 367

h-index: 11

J. Rehg

Citations: 1,468

h-index: 21

Junho Kim

Citations: 131

h-index: 7

Ho-Jun Lee

Citations: 25

h-index: 2

Minsu Kim

Citations: 939

h-index: 18

최근 비디오 거대 언어 모델(Video-LLM)의 발전으로, 복잡하고 긴 비디오에 대한 강력한 오프라인 추론이 가능해졌습니다. 그러나 실제 환경에서는 실시간 인지 및 능동적인 상호 작용이 점점 더 중요해지고 있으며, 이는 비디오 프레임이 실시간으로 입력되고 시스템이 응답 내용뿐만 아니라 응답 시점을 결정해야 하는 상황을 의미합니다. 본 연구에서는 능동적인 활성화 과정을 실시간 비디오에서 구조화된 시퀀스 모델링 문제로 재정의합니다. 이는 실시간 비디오의 시간적 변화가 자연스럽게 스팬 구조의 활성화 패턴을 형성한다는 관찰에서 비롯되었습니다. 이러한 스팬 수준의 구조를 포착하기 위해, 우리는 슬라이딩 윈도우 내에서 활성화 신호를 동시에 모델링하고 새로운 프레임이 도착함에 따라 이를 반복적으로 업데이트합니다. 우리는 경량화된 마스킹 확산 모듈을 활성화 인터페이스에 적용하여 활성화 신호를 동시에 예측하고 점진적으로 개선하는 STRIDE(Structured Temporal Refinement with Iterative DEnoising)를 제안합니다. 다양한 실시간 벤치마크 및 다운스트림 모델에 대한 광범위한 실험 결과, STRIDE는 더 안정적이고 시간적으로 일관된 능동적인 응답을 제공하며, 온라인 실시간 시나리오에서 응답 시점 결정의 품질을 크게 향상시키는 것으로 나타났습니다.

Original Abstract

Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.

1 Citations

0 Influential

10.5 Altmetric

53.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!