2602.01750v1 Feb 02, 2026 cs.AI

보상 해킹의 능동적 탐지 및 완화를 위한 적대적 보상 감사(Adversarial Reward Auditing)

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

M. Beigi

Citations: 66

h-index: 4

Qifan Wang

Citations: 280

h-index: 9

Ming Jin

Citations: 38

h-index: 3

Lifu Huang

Citations: 4

h-index: 1

Junshan Zhang

Citations: 20

h-index: 2

인간 피드백 기반 강화 학습(RLHF)은 모델이 학습된 보상 모델의 허위 상관관계를 악용하여 인간의 의도를 위반하면서도 높은 점수를 획득하는 보상 해킹(reward hacking)에 여전히 취약합니다. 기존의 완화 대책은 새로운 악용 전략에 적응하지 못하는 정적인 방어에 의존하고 있습니다. 본 논문에서는 보상 해킹을 동적이고 경쟁적인 게임으로 재개념화한 프레임워크인 적대적 보상 감사(ARA)를 제안합니다. ARA는 두 단계로 작동합니다. 첫 번째 단계에서는 해커(Hacker) 정책이 보상 모델의 취약점을 찾아내는 동안 감사자(Auditor)는 잠재 표현으로부터 이러한 악용을 탐지하는 방법을 학습합니다. 두 번째 단계에서는 감사자 유도 RLHF(AG-RLHF)가 탐지된 해킹에 불이익을 주도록 보상 신호를 제어하여, 보상 해킹을 관찰 불가능한 실패에서 측정 및 제어 가능한 신호로 전환합니다. 세 가지 해킹 시나리오에 걸친 실험 결과, ARA는 모든 베이스라인 중에서 가장 우수한 정렬-유용성 트레이드오프를 달성함을 입증했습니다. 구체적으로 유용성을 향상시키면서 아첨(sycophancy)을 SFT 수준에 가깝게 줄였고, 가장 높은 ROUGE-L 점수를 달성하면서 장황함(verbosity)을 감소시켰으며, Pass@1 성능을 개선하면서 코드 게이밍(code gaming)을 억제했습니다. 단일 도메인 평가를 넘어, 우리는 보상 해킹, 탐지 및 완화가 모두 도메인 간에 일반화됨을 보였습니다. 코드 게이밍에 대해 훈련된 해커는 해당 행동에 대한 보상이 없음에도 불구하고 아첨 성향이 증가했으며, 한 도메인에서 훈련된 감사자는 다른 도메인에서의 악용을 효과적으로 억제하여 단일 모델로 효율적인 다중 도메인 방어가 가능함을 입증했습니다.

Original Abstract

Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains -- a Hacker trained on code gaming exhibits increased sycophancy despite no reward for this behavior, and an Auditor trained on one domain effectively suppresses exploitation in others, enabling efficient multi-domain defense with a single model.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!