2602.18037v1 Feb 20, 2026 cs.LG

그라디언트 정규화는 인간 피드백 및 검증 가능한 보상 기반 강화 학습에서 보상 해킹을 방지한다

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Jan Ackermann

Citations: 68

h-index: 4

Michael Noukhovitch

Mila

Citations: 387

h-index: 7

Takashi Ishida

Citations: 671

h-index: 8

Masashi Sugiyama

Citations: 23

h-index: 3

인간 피드백 기반 강화 학습(RLHF) 또는 검증 가능한 보상(RLVR)은 현대 언어 모델(LM)의 사후 학습에 있어 핵심적인 두 단계입니다. 흔한 문제 중 하나는 보상 해킹으로, 이는 정책이 보상의 부정확성을 악용하여 의도치 않은 행동을 학습하는 현상입니다. 대부분의 기존 연구들은 참조 모델에 대한 쿨백-라이블러(KL) 페널티를 사용하여 정책 업데이트를 제한함으로써 이 문제를 해결합니다. 우리는 다른 접근 방식을 제안합니다. 즉, 보상이 더 정확한 영역으로 정책 업데이트가 편향되도록 LM을 학습시키는 것입니다. 먼저, 우리는 보상 모델의 정확도와 수렴 시 최적점의 평탄도(flatness) 사이의 이론적 연관성을 도출합니다. 그 후 그라디언트 정규화(GR)를 사용하여 학습을 더 평탄한 영역으로 유도하고 이를 통해 보상 모델의 정확도를 유지할 수 있습니다. 우리는 RLHF에서 그라디언트 노름(norm)과 보상 정확도가 경험적으로 상관관계가 있음을 보여줌으로써 이러한 결과를 확인합니다. 그런 다음 KL 페널티의 참조 재설정(Reference Resets)이 암시적으로 GR을 사용하여 보상 정확도가 더 높은 평탄한 영역을 찾는다는 것을 보여줍니다. 우리는 효율적인 유한 차분 추정치를 사용한 명시적 GR을 제안함으로써 이를 더욱 개선합니다. 경험적으로, GR은 LM을 이용한 다양한 RL 실험 전반에서 KL 페널티보다 더 나은 성능을 보입니다. GR은 RLHF에서 GPT가 판단한 승률을 더 높이고, 규칙 기반 수학 보상에서 형식에 과도하게 집중하는 것을 방지하며, LLM-as-a-Judge 수학 과제에서 심사 해킹을 방지합니다.

Original Abstract

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

4 Citations

0 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!