2605.06036v1 May 07, 2026 cs.LG

잡음이 포함된 선호도 데이터를 활용한 LLM 보상 모델링을 위한 최적 수송

Optimal Transport for LLM Reward Modeling from Noisy Preference

Shijian Wang

Citations: 89

h-index: 4

Licheng Pan

Citations: 244

h-index: 7

Zhixuan Chu

Citations: 38

h-index: 3

Yuan Lu

Citations: 27

h-index: 1

Haoxuan Li

Citations: 65

h-index: 5

Lei Shen

Citations: 19

h-index: 2

Yinuo Wang

Citations: 2

h-index: 1

Hao Yang

Citations: 12

h-index: 2

Yu-An Lu

National Chupei High School

Citations: 30

h-index: 2

Yongqi Tong

Citations: 400

h-index: 6

Hao Wang

Citations: 15

h-index: 2

보상 모델은 인간 피드백 기반 강화 학습(RLHF)의 핵심 요소이지만, 실제 데이터 세트는 필연적으로 잡음이 포함된 선호도를 가지고 있습니다. 기존의 학습 목표는 이러한 오류에 과적합되는 경향이 있으며, 기존의 노이즈 제거 방법은 종종 언어적 선호도의 복잡성을 제대로 반영하지 못하는 균일한 노이즈 가정을 기반으로 합니다. 이러한 문제점을 해결하기 위해, 최적 수송 이론에 기반한 SelectiveRM이라는 프레임워크를 제안합니다. 먼저, 모델 예측 분포와 선호도 데이터 간의 불일치를 해소하기 위한 Joint Consistency Discrepancy를 설계합니다. 또한, 엄격한 질량 보존 제약으로 인해 이상치에 과적합되는 문제를 해결하기 위해, 부분 최적 수송을 통한 질량 완화(Mass Relaxation) 메커니즘을 도입합니다. 이를 통해, 의미적 일관성과 모순되는 잡음이 포함된 샘플을 자동으로 제외할 수 있습니다. 이론적으로, SelectiveRM은 관측되지 않은 깨끗한 데이터에 대한 더욱 엄격한 상한을 최적화함을 증명합니다. 광범위한 실험 결과, 제안하는 방법이 다양한 벤치마크에서 최첨단 모델보다 훨씬 우수한 성능을 보임을 확인했습니다.

Original Abstract

Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!