2606.13141v1 Jun 11, 2026 cs.AI

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

Nicole Hee-Yeon Kim
Nicole Hee-Yeon Kim
Citations: 17
h-index: 2
F. Porikli
F. Porikli
Citations: 623
h-index: 9
Jisu Shin
Jisu Shin
Citations: 73
h-index: 5
Yuho Lee
Yuho Lee
Citations: 70
h-index: 4
Jihwan Bang
Jihwan Bang
Citations: 22
h-index: 3
Juntae Lee
Juntae Lee
Citations: 11
h-index: 3
Kyuwoong Hwang
Kyuwoong Hwang
Citations: 20
h-index: 3
Hwanjun Song
Hwanjun Song
Citations: 300
h-index: 8

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

0 Citations
0 Influential
4.5 Altmetric
22.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!