2604.02986v1 Apr 03, 2026 cs.LG

RLHF에서 어드밴티지 신호의 강건성을 활용한 보상 해킹 완화

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Takashi Ishida

Citations: 671

h-index: 8

Masashi Sugiyama

Citations: 23

h-index: 3

Shinnosuke Ono

Citations: 11

h-index: 2

Johannes Ackermann

Citations: 18

h-index: 3

Soichiro Nishimori

Citations: 89

h-index: 4

인간 피드백 기반 강화 학습(RLHF)에 사용되는 보상 모델(RM)은 보상 해킹에 취약합니다. 즉, 정책이 학습된 보상 프록시를 최대화함에 따라 실제 품질이 정체되거나 저하될 수 있습니다. 본 연구에서는 보상 해킹이 종종 반전된 어드밴티지 신호로 인해 발생한다고 가정합니다. 즉, 부정적인 응답의 가능성을 줄이는 대신, 반전된 신호는 업데이트를 통해 오히려 그 가능성을 증가시킵니다. 우리는 RM 파라미터 공간에서 적대적 섭동을 고려하여, 정책 최적화 과정에서 어드밴티지 신호가 반전될 수 있는 가장 작은 섭동 크기인 '신호 보존 반경'을 도출합니다. 이 공식에 기반하여, 우리는 신호 보존 정책 최적화(SignCert-PO) 방법을 제안합니다. SignCert-PO는 정책 그래디언트 업데이트에서 강건하지 않은 결과를 가중치 감소시켜 사용합니다. 기존 방식과는 달리, SignCert-PO는 여러 개의 RM을 사용하거나 RM 훈련 데이터를 필요로 하지 않으며, 가볍고 정책 최적화 단계에서 RM 파라미터와 온-정책 결과만을 사용하여 작동합니다. TL;DR 요약 및 AlpacaFarm 벤치마크에서 SignCert-PO는 기준 모델보다 우수한 성능을 지속적으로 보이며, 보상 해킹을 줄입니다.

Original Abstract

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!