2603.18736v1 Mar 19, 2026 cs.LG

CausalRM: 관찰 기반 사용자 피드백을 활용한 인과론적 보상 모델링을 통한 강화 학습 인간 피드백 (RLHF)

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Xiaoxi Li

Citations: 17

h-index: 2

Hao Wang

Citations: 82

h-index: 5

Licheng Pan

Citations: 206

h-index: 7

Zhichao Chen

Citations: 414

h-index: 9

Chunyuan Zheng

Citations: 609

h-index: 14

Zhixuan Chu

Citations: 27

h-index: 2

Yuan Lu

Citations: 25

h-index: 1

Xinggao Liu

Citations: 397

h-index: 12

Zhouchen Lin

Citations: 129

h-index: 6

Haoxuan Li

Citations: 49

h-index: 4

강화 학습 인간 피드백 (RLHF)이 언어 모델의 성능 향상에 기여해 왔지만, 현재의 보상 모델링은 통제되고 비용이 많이 드는 환경에서 수집된 인간 어노테이터의 실험적 피드백 데이터에 크게 의존합니다. 본 연구에서는 확장 가능하고 비용 효율적인 대안으로, 클릭, 복사, 좋아요와 같은 관찰 기반 사용자 피드백을 활용하여 보상 모델을 학습하는 관찰 기반 보상 모델링을 제안합니다. 이 설정에서 우리는 두 가지 중요한 과제를 발견했습니다. (1) 관찰 기반 피드백은 어노테이션 오류로 인해 노이즈가 많아 실제 사용자 선호도와 일치하지 않을 수 있습니다. (2) 관찰 기반 피드백은 사용자 선호도에 의해 편향될 수 있는데, 사용자는 자신이 강하게 느끼는 응답에 대해 더 선호적으로 피드백을 제공하므로, 학습 및 추론 데이터 간에 분포 변화가 발생합니다. 이러한 과제에 대응하기 위해, 우리는 관찰 기반 피드백으로부터 편향되지 않은 보상 모델을 학습하는 것을 목표로 하는 인과론적 보상 모델링 프레임워크인 CausalRM을 제안합니다. 첫 번째 과제(1)를 해결하기 위해, CausalRM은 어노테이션 오류 생성 과정을 명시적으로 모델링하는 노이즈 인지 대체 손실 항을 도입하여, 노이즈가 없는 조건에서 원본 손실과 동등함을 증명합니다. 두 번째 과제(2)를 해결하기 위해, CausalRM은 주어진 응답에 대해 사용자가 피드백을 제공할 확률인 propensity score를 사용하여 학습 샘플의 가중치를 재조정하여, 사용자 선호도 편향을 제거하는 손실 함수를 얻습니다. 다양한 LLM 백본과 벤치마크 데이터 세트를 사용한 광범위한 실험 결과, CausalRM이 노이즈가 많고 편향된 관찰 기반 피드백으로부터 정확한 보상 신호를 효과적으로 학습하고, 다운스트림 RLHF 작업에서 상당한 성능 향상을 달성한다는 것을 확인했습니다. (예: WildGuardMix에서 49.2% 향상, HarmBench에서 32.7% 향상). 관련 코드는 프로젝트 웹사이트에서 확인할 수 있습니다.

Original Abstract

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

1 Citations

0 Influential

7 Altmetric

36.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!