2606.09064v1 Jun 08, 2026 cs.CV

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

Chen Jia

Citations: 69

h-index: 4

Yumeng Zhang

Citations: 104

h-index: 3

Naiming Liu

Citations: 309

h-index: 9

Yi Lu

University of Toronto

Citations: 194

h-index: 7

Bowen Liu

Citations: 9

h-index: 2

Shuning Wang

Citations: 1

h-index: 1

Zhiheng Wu

Citations: 50

h-index: 4

Shuo Nie

Citations: 4

h-index: 2

Weijie Zhu

Citations: 0

h-index: 0

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!