2604.19544v1 Apr 21, 2026 cs.AI

DT2IT-MRM: 편향 제거를 통한 선호도 구성 및 반복 훈련을 통한 다중 모드 보상 모델링

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Zhihong Zhang

Citations: 30

h-index: 3

Jie Zhao

Citations: 43

h-index: 3

Jin Xu

Citations: 3

h-index: 1

Zhuo Luo

Citations: 2

h-index: 1

Xin Liu

Citations: 31

h-index: 2

Jiansheng Wei

Citations: 62

h-index: 4

Xuejin Chen

Citations: 3

h-index: 1

Xiaojiang Huang

Citations: 23

h-index: 2

다중 모드 보상 모델(MRM)은 다중 모드 대규모 언어 모델(MLLM)을 인간의 선호도에 맞추는 데 중요한 역할을 합니다. 좋은 MRM을 훈련하려면 고품질의 다중 모드 선호도 데이터가 필요합니다. 그러나 기존의 선호도 데이터 세트는 세 가지 주요 문제점을 가지고 있습니다: 선호도 강도의 세분화 부족, 텍스트 스타일 편향, 그리고 신뢰할 수 없는 선호도 신호입니다. 또한, 기존의 공개된 다중 모드 선호도 데이터 세트는 상당한 노이즈를 포함하고 있지만, 이러한 품질을 향상시킬 수 있는 효과적이고 확장 가능한 큐레이션 방법은 부족합니다. 이러한 한계점을 해결하기 위해, 우리는 extbf{DT2IT-MRM}을 제안합니다. extbf{DT2IT-MRM}은 extbf{D}ebiased (편향 제거) 선호도 구성 파이프라인, 텍스트-이미지( extbf{T2I}) 선호도 데이터의 새로운 재구성 방식, 그리고 기존의 다중 모드 선호도 데이터 세트를 extbf{M}ultimodal extbf{R}eward extbf{M}odeling (다중 모드 보상 모델링)을 위해 큐레이션하는 extbf{I}terative extbf{T}raining (반복 훈련) 프레임워크를 통합합니다. 우리의 실험 결과는 DT2IT-MRM이 세 가지 주요 벤치마크인 VL-RewardBench, Multimodal RewardBench, 및 MM-RLHF-RewardBench에서 새로운 extbf{최고 성능}을 달성했음을 보여줍니다.

Original Abstract

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!