2604.01002v1 Apr 01, 2026 cs.CV

쿼리 조건 기반 증거 기반 키프레임 샘플링: MLLM 기반 장편 비디오 이해

Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

Yueqian Lin

Citations: 440

h-index: 10

Yiran Chen

Citations: 141

h-index: 6

Yiheng Wang

Citations: 24

h-index: 3

Lichen Zhu

Citations: 0

h-index: 0

Yudong Liu

Citations: 59

h-index: 5

Jingyang Zhang

Citations: 189

h-index: 7

HaiHelenLi

Citations: 0

h-index: 0

다중 모드 대규모 언어 모델(MLLM)은 비디오 질의 응답에서 뛰어난 성능을 보였지만, 제한된 컨텍스트 길이와 계산 비용으로 인해 장편 비디오에 적용하는 데 어려움이 있으며, 이로 인해 키프레임 샘플링이 필수적입니다. 기존 방법은 일반적으로 의미적 관련성 또는 강화 학습에 의존하며, 이는 증거 단서를 제대로 포착하지 못하거나 비효율적인 조합 최적화 문제를 겪습니다. 본 연구에서는 정보 병목 이론에 기반한 증거 기반 키프레임 샘플링 프레임워크를 제안합니다. 우리는 키프레임 선택을 선택된 프레임과 질의 간의 조건부 상호 정보의 최대화로 정의하여, 각 프레임이 질의 응답에 기여하는 정도를 반영하는 원칙적인 목표를 제시합니다. 이 목표를 실현 가능하게 만들기 위해, 그 구조를 활용하여 부분 집합 선택을 독립적인 프레임 수준의 점수로 줄이는 분해된 최적화를 도출합니다. 또한, 질의 조건에 따라 증거 중요도를 효율적으로 추정하기 위해 대비 학습 목표를 사용하여 학습된 질의 조건 증거 점수 네트워크를 도입합니다. 장편 비디오 이해 벤치마크에서의 실험 결과, 제안하는 방법은 엄격한 토큰 예산 하에서 기존 샘플링 전략보다 일관되게 우수한 성능을 보이며, 훈련 효율성을 크게 향상시킵니다.

Original Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!