2604.05418v1 Apr 07, 2026 cs.CV

VideoStir: 시공간 구조화 및 의도 기반 검색 증강 생성(RAG)을 통한 장편 비디오 이해

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

Yiwei Wang

Citations: 179

h-index: 8

Yujun Cai

Citations: 3,707

h-index: 20

Honghao Fu

Nanyang Technological University

Citations: 366

h-index: 6

Miao Xu

Citations: 2

h-index: 1

Dailing Zhang

Citations: 135

h-index: 7

Liu Jun

Citations: 0

h-index: 0

다중 모달 대규모 언어 모델(MLLM)을 장편 비디오에 적용하는 것은 제한된 컨텍스트 창으로 인해 어려움을 겪습니다. 검색 증강 생성(RAG)은 관련 시각 정보를 압축된 컨텍스트로 구성하여 유망한 해결책을 제시하지만, 대부분의 기존 방법은 (i) 비디오를 독립적인 세그먼트로 분할하여 고유한 시공간 구조를 파괴하고, (ii) 명시적인 의미 일치에 의존하여 쿼리의 의도와 암묵적으로 관련된 단서를 놓칠 수 있습니다. 이러한 한계를 극복하기 위해, 우리는 구조화되고 의도 기반의 장편 비디오 RAG 프레임워크인 VideoStir를 제안합니다. VideoStir는 먼저 비디오를 클립 수준의 시공간 그래프로 구조화하고, 멀리 떨어져 있지만 맥락적으로 관련된 이벤트 간의 정보를 종합하기 위해 다중 홉 검색을 수행합니다. 또한, 쿼리의 추론 의도와 일치하는 프레임을 검색하는 MLLM 기반 의도-관련성 점수 시스템을 도입했습니다. 이러한 기능을 지원하기 위해, 프레임-쿼리 의도 정렬 학습에 특화된 대규모 데이터셋인 IR-600K를 구축했습니다. 실험 결과, VideoStir는 추가적인 정보 없이 최첨단 모델과 경쟁력을 보여주며, 장편 비디오 RAG를 단순한 의미 일치에서 구조화되고 의도 기반의 추론으로 전환할 수 있는 가능성을 보여줍니다. 코드 및 체크포인트는 Github에서 제공됩니다.

Original Abstract

Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!