2604.10966v1 Apr 13, 2026 cs.CV

단 한 번의 추론으로 모든 응답을 평가하는 다중 응답 보상 모델

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

Ranjay Krishna

Citations: 853

h-index: 18

Jieyu Zhang

Citations: 59

h-index: 2

Zixian Ma

Citations: 772

h-index: 13

Yinuo Yang

Citations: 65

h-index: 3

Manasi Ganti

Citations: 9

h-index: 1

본 논문에서는 단일 추론 과정에서 모든 후보 응답에 대한 점수를 매기는 차별적 다중 모드 보상 모델을 제시합니다. 기존의 차별적 보상 모델은 각 응답을 독립적으로 평가하며, 잠재적인 각 응답에 대해 여러 번의 추론 과정을 거쳐야 합니다. 저희의 접근 방식은 여러 응답을 구분자 토큰과 함께 연결하고, 이들의 스칼라 점수에 대해 교차 엔트로피를 적용하여 직접적인 비교 추론을 가능하게 하고 효율적인 N-way 선호도 학습을 지원합니다. 또한, 다중 응답 설계는 기존의 단일 응답 평가 방식에 비해 최대 N배의 연산 시간 단축 및 FLOPs 감소를 가져옵니다. 기존의 pairwise 벤치마크를 넘어 N-way 보상 평가를 가능하게 하기 위해, 저희는 두 개의 새로운 벤치마크를 구축했습니다. (1) MR²Bench-Image는 8개의 다양한 모델에서 생성된 응답에 대한 인간이 직접 작성한 순위 데이터를 포함합니다. (2) MR²Bench-Video는 19개의 모델에서 생성된 비디오 질의응답에 대한 94,000건의 크라우드 소싱된 pairwise 인간 판단 데이터로 구성된 대규모 비디오 기반 보상 벤치마크이며, 선호도 그래프 앙상블을 통해 노이즈를 제거했습니다. 두 벤치마크 모두 전체 순위에서 샘플링된 4개의 응답에 대한 평가 버전을 제공합니다. 40억 개의 파라미터를 가진 비전-언어 백본과 LoRA 미세 조정, 그리고 경량 MLP 값 헤드를 기반으로 구축된 저희 모델은 MR²Bench-Image, MR²Bench-Video 및 4개의 기존 벤치마크를 포함한 6개의 다중 모드 보상 벤치마크에서 최첨단 결과를 달성했습니다. 저희 모델은 기존의 더 큰 생성 및 차별적 보상 모델보다 우수한 성능을 보입니다. 또한, 저희의 보상 모델을 GRPO를 사용한 강화 학습에 적용했을 때, 기존의 다중 응답 차별적 보상 모델(RM) 기준 모델보다 학습 안정성 및 생성 품질 모두에서 크게 향상된 정책 모델을 얻을 수 있으며, 표준 다중 모드 벤치마크에서의 성능을 유지하는 동시에 개방형 생성 품질을 크게 향상시킬 수 있음을 보여줍니다.

Original Abstract

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!