2602.07533v1 Feb 07, 2026 cs.AI

결합 보상 모델링: 효율적인 시각적 보상 모델을 위한 생각의 사슬(Chain-of-Thought) 내재화

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Yankai Yang

Citations: 0

h-index: 0

Yancheng Long

Citations: 1

h-index: 1

Wei Chen

Citations: 121

h-index: 5

Tianke Zhang

Citations: 211

h-index: 8

Kaiyu Jiang

Citations: 160

h-index: 5

Haonan Fan

Citations: 81

h-index: 2

Changyi Liu

Citations: 194

h-index: 7

Jiankang Chen

Citations: 204

h-index: 6

Kaiyu Tang

Citations: 138

h-index: 4

Bin Wen

Citations: 311

h-index: 9

Fan Yang

Citations: 171

h-index: 7

Tingting Gao

Citations: 307

h-index: 9

Han Li

Citations: 3

h-index: 1

Shuo Yang

Citations: 0

h-index: 0

Hongyang Wei

Citations: 99

h-index: 5

보상 모델은 생성 모델의 정렬 품질과 신뢰성을 결정짓기에 인간 피드백 기반 강화 학습(RLHF)에서 핵심적인 역할을 한다. 이미지 편집과 같이 복잡한 작업에서 보상 모델은 단순한 국소적 유사성을 넘어 전역적인 의미 일관성과 암묵적인 논리적 제약 조건까지 포착해야 한다. 그러나 기존 보상 모델링 접근법들은 뚜렷한 한계를 보인다. 판별적(discriminative) 보상 모델은 인간의 선호도와는 잘 부합하지만 추론에 대한 감독이 제한적이어서 복잡한 의미론적 처리에 어려움이 있다. 반면 생성적(generative) 보상 모델은 더 뛰어난 의미 이해력과 추론 능력을 갖췄으나, 추론 시 비용이 많이 들고 인간 선호도와 직접적으로 정렬하기 어렵다는 단점이 있다. 이에 우리는 공유된 비전-언어 백본 위에서 선호도 학습과 언어 모델링을 공동으로 최적화하는 결합 보상 모델링(Joint Reward Modeling, JRM)을 제안한다. 이 방식은 생성 모델의 의미론적 이해 및 추론 능력을 효율적인 판별적 표현으로 내재화하여 빠르고 정확한 평가를 가능하게 한다. JRM은 MMRB2와 EditReward-Bench에서 최고 수준(SOTA)의 성과를 달성했으며, 다운스트림 온라인 강화 학습에서의 안정성과 성능 또한 크게 개선했다. 이러한 결과는 결합 학습이 보상 모델링의 효율성과 의미 이해 간의 간극을 효과적으로 메워준다는 것을 입증한다.

Original Abstract

Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!