2605.07872v1 May 08, 2026 cs.CV

비디오 이해 보상 모델링: 견고한 벤치마크 및 우수한 성능의 보상 모델

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Fandong Meng

Citations: 7,483

h-index: 42

Yuancheng Wei

Citations: 158

h-index: 3

Haojie Zhang

Citations: 153

h-index: 3

Linli Yao

Citations: 119

h-index: 4

Lei Li

Peking University

Citations: 7,911

h-index: 31

Hao Zhou

Citations: 13

h-index: 2

Xu Sun

Citations: 99

h-index: 3

다중 모드 보상 모델은 텍스트 및 이미지 분야에서 상당한 발전을 이루었지만, 비디오 이해 보상 모델링 분야의 발전은 견고한 평가 벤치마크 및 고품질 선호도 데이터 부족으로 인해 심각하게 제한되어 있습니다. 이러한 문제를 해결하기 위해, 우리는 벤치마크 설계, 데이터 구축, 그리고 보상 모델 훈련을 포괄하는 통합 프레임워크를 제안합니다. 우리는 2,100개의 선호도 쌍으로 구성된 벤치마크인 Video Understanding Reward Bench (VURB)를 소개합니다. VURB는 일반, 장문, 그리고 추론 지향적인 비디오 작업에 대한 다수 투표 평가를 특징으로 하며, 각 쌍은 평균 1,143 토큰에 이르는 긴 사고 과정 추적을 포함합니다. 또한, 우리는 완전 자동화된 파이프라인을 통해 Video Understanding Preference Dataset (VUP-35K)을 구축하여, 비디오 보상 훈련을 위한 대규모의 고품질 지도 데이터를 제공합니다. 구축된 데이터를 기반으로, 우리는 판별 모델인 VideoDRM과 생성 모델인 VideoGRM을 훈련했으며, 두 모델 모두 VURB 및 VideoRewardBench에서 최첨단 성능을 달성했습니다. 추가 분석 결과, VUP-35K는 보상 성능과 모델의 추론 능력을 모두 향상시키며, VideoDRM과 VideoGRM은 best-of-$N$ 테스트 시간 스케일링에서 상당한 성능 향상을 보입니다.

Original Abstract

Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.

0 Citations

0 Influential

21 Altmetric

105.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!