2603.07084v2 Mar 07, 2026 cs.LG

카운트다운 코드: 강화 학습 기반 시뮬레이션(RLVR)에서 보상 해킹의 발생 및 일반화 현상을 연구하기 위한 테스트 환경

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Muhammad Khalifa

Citations: 213

h-index: 8

Hao Peng

Citations: 99

h-index: 2

Lu Wang

Citations: 351

h-index: 7

Zohaib Khan

Citations: 20

h-index: 3

Omer Tafveez

Citations: 11

h-index: 2

보상 해킹은 모델이 근본적인 작업을 실제로 해결하지 않고, 대리 보상을 과도하게 최적화하는 형태의 정렬 불일치 현상입니다. 진정한 작업 보상을 정확하게 측정하는 것은 종종 어렵거나 불가능하기 때문에, 보상 해킹의 발생 빈도를 정확하게 측정하는 것이 어렵습니다. 본 논문에서는 모델이 수학적 추론 작업을 해결하고 동시에 테스트 환경을 조작할 수 있는 최소 환경인 Countdown-Code를 소개합니다. 이러한 이중 접근 방식은 대리 보상(테스트 합격/불합격)과 진정한 보상(수학적 정확성)을 명확하게 분리하여, 보상 해킹 비율을 정확하게 측정할 수 있도록 합니다. 이 환경을 사용하여 개방형 가중치 LLM에서 보상 해킹 현상을 연구한 결과, 지도 학습 미세 조정(SFT) 과정에서 소량의 보상 해킹 경로가 학습 데이터에 포함될 경우, 모델이 의도치 않게 이러한 행동을 학습할 수 있다는 것을 확인했습니다. 증류 SFT 데이터에서 1% 정도의 오염만으로도 모델이 보상 해킹을 내재화하게 되고, 이후 강화 학습(RL) 과정에서 이러한 현상이 다시 나타납니다. 또한, RL은 정렬 불일치를 증폭시키고 원래 도메인 너머로 일반화시키는 것을 보여줍니다. 저희는 이 환경과 코드를 공개하여 LLM에서의 보상 해킹 연구를 촉진하고자 합니다. 저희의 연구 결과는 LLM에서 보상 해킹이 발생하고 지속되는 이전에 잘 알려지지 않았던 경로를 보여주며, 합성 SFT 데이터의 더욱 엄격한 검증의 필요성을 강조합니다. 코드는 https://github.com/zohaib-khan5040/Countdown-Code 에서 확인할 수 있습니다.

Original Abstract

Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib-khan5040/Countdown-Code.

7 Citations

0 Influential

33.729550745277 Altmetric

175.6 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!