2606.09064v1 Jun 08, 2026 cs.CV

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

Chen Jia
Chen Jia
Citations: 69
h-index: 4
Yumeng Zhang
Yumeng Zhang
Citations: 104
h-index: 3
Naiming Liu
Naiming Liu
Citations: 309
h-index: 9
Yi Lu
Yi Lu
University of Toronto
Citations: 194
h-index: 7
Bowen Liu
Bowen Liu
Citations: 9
h-index: 2
Shuning Wang
Shuning Wang
Citations: 1
h-index: 1
Zhiheng Wu
Zhiheng Wu
Citations: 50
h-index: 4
Shuo Nie
Shuo Nie
Citations: 4
h-index: 2
Weijie Zhu
Weijie Zhu
Citations: 0
h-index: 0

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

0 Citations
0 Influential
4.5 Altmetric
22.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!