2603.23184v1 Mar 24, 2026 cs.CL

ImplicitRM: LLM 정렬을 위한 암묵적 선호도 데이터로부터 편향되지 않은 보상 모델링

ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment

Xiaoxi Li

Citations: 17

h-index: 2

Licheng Pan

Citations: 206

h-index: 7

Zhichao Chen

Citations: 414

h-index: 9

Yuan Lu

Citations: 25

h-index: 1

Zhouchen Lin

Citations: 129

h-index: 6

Haoxuan Li

Citations: 49

h-index: 4

Hao Wang

Citations: 92

h-index: 4

Haochen Yang

Citations: 26

h-index: 3

Lei Shen

Citations: 19

h-index: 2

Yinuo Wang

Citations: 2

h-index: 1

언어 모델 정렬을 위한 인간 피드백 강화 학습(RLHF)에서 보상 모델링은 오랜 과제로 남아 있습니다. 현재의 보상 모델링은 높은 수집 비용이 발생하는 실험적 피드백 데이터에 크게 의존합니다. 본 연구에서는 비용 효율적인 대안으로, 암묵적 인간 피드백(예: 클릭 및 복사)으로부터 보상 모델을 학습하는 extit{암묵적 보상 모델링}을 연구합니다. 암묵적 보상 모델링에서 우리는 두 가지 근본적인 과제를 발견했습니다. (1) 암묵적 선호도 데이터는 명확한 부정 샘플이 부족하여, 표준적인 양성-부정 분류 방법이 적용될 수 없습니다. (2) 암묵적 선호도 데이터는 사용자 선호도 편향을 겪으며, 이는 서로 다른 응답이 사용자 피드백 행동을 유발하는 경향이 다르기 때문에, 명확한 부정 샘플을 구별하기 어렵게 만듭니다. 이러한 과제들을 해결하기 위해, 우리는 암묵적 선호도 데이터로부터 편향되지 않은 보상 모델을 학습하는 ImplicitRM을 제안합니다. ImplicitRM은 계층화 모델을 사용하여 학습 샘플을 네 가지 잠재 그룹으로 분류합니다. 이를 바탕으로, 우리는 가능도 최대화를 통한 학습 목표를 도출하며, 이는 이론적으로 편향되지 않았음을 증명하며, 두 가지 과제를 효과적으로 해결합니다. 실험 결과, ImplicitRM은 다양한 암묵적 선호도 데이터셋에서 정확한 보상 모델을 학습하는 것으로 나타났습니다. 코드 및 관련 정보는 프로젝트 웹사이트에서 확인할 수 있습니다.

Original Abstract

Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!