2605.06185v1 May 07, 2026 cs.AI

이벤트-인과 관계 기반 검색 증강 생성 (RAG): 복잡한 시나리오에서 긴 동영상 추론을 위한 검색 증강 생성 프레임워크

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

Yu Zhao

Citations: 9

h-index: 2

Liang Xie

Citations: 43

h-index: 4

Erwei Yin

Citations: 1,535

h-index: 20

P. Yan

Citations: 0

h-index: 0

Juntong Qi

Citations: 33

h-index: 3

Mingming Wang

Citations: 64

h-index: 3

최근의 대규모 시각-언어 모델은 짧고 중간 길이의 동영상 이해에서 강력한 성능을 보였지만, 모델이 장기간에 걸쳐 일관성을 유지하고 시간적으로 멀리 떨어진 이벤트 간의 인과적 의존성을 추론해야 하는 매우 긴 또는 무한한 길이의 동영상 추론에는 여전히 부족합니다. 기존의 엔드 투 엔드 동영상 이해 방법은 자기 주의 메커니즘의 $O(n^2)$ 복잡성으로 인해 근본적인 한계를 가지고 있으며, 최근의 검색 증강 생성 (RAG) 접근 방식은 여전히 단편적인 클립 수준의 메모리, 시간적 및 인과적 구조에 대한 취약한 모델링, 그리고 높은 저장 및 온라인 추론 비용으로 인해 어려움을 겪습니다. 본 논문에서는 무한한 길이의 동영상 추론을 위한 경량의 검색 증강 프레임워크인 Event-Causal RAG를 제안합니다. 기존 방법과 달리, 저희 방법은 고정된 길이의 클립을 인덱싱하는 대신, 스트리밍 동영상을 의미적으로 일관된 이벤트로 분할하고, 각 이벤트를 이벤트와 주변 상태 변화를 포괄하는 구조화된 상태-이벤트-상태 (SES) 그래프로 표현합니다. 이러한 그래프는 전역 이벤트 지식 그래프로 병합되어 저장되며, 의미적 매칭과 인과적-위상적 검색을 모두 지원하는 이중 저장소 메모리를 사용합니다. 이 메모리 위에, 저희는 가장 관련성이 높은 이벤트 인과 관계를 효율적으로 식별하기 위한 양방향 검색 전략을 설계하고, 관련 동영상 증거와 함께 이를 핵심 동영상 기반 모델에 제공하여 답변을 생성합니다. 긴 동영상 이해 벤치마크에서의 실험 결과, Event-Causal RAG는 여러 이벤트를 통합하고 시간적으로 먼 간격에 걸쳐 인과적 추론이 필요한 질문에서, 기존의 클립 기반 검색 모델 및 장문 컨텍스트 동영상 모델보다 일관되게 우수한 성능을 보이며, 또한 메모리 효율성을 향상시키고 안정적인 스트리밍 성능을 달성합니다.

Original Abstract

Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!