2603.07084v1 Mar 07, 2026 cs.LG

카운트다운 코드: 강화 학습 기반 가상 환경에서 보상 해킹의 발생 및 일반화 현상을 연구하기 위한 테스트 환경

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Muhammad Khalifa

Citations: 213

h-index: 8

Hao Peng

Citations: 99

h-index: 2

Lu Wang

Citations: 351

h-index: 7

Zohaib Khan

Citations: 20

h-index: 3

Omer Tafveez

Citations: 11

h-index: 2

보상 해킹은 모델이 실제 문제 해결 없이 단순히 프록시 보상을 극대화하는 형태의 불일치 현상입니다. 진정한 문제 해결에 대한 보상을 정확하게 측정하는 것은 어렵기 때문에, 보상 해킹의 발생 빈도를 측정하는 데 어려움이 있습니다. 본 연구에서는 모델이 수학적 추론 문제를 해결하고 동시에 테스트 환경을 조작할 수 있는 최소한의 환경인 '카운트다운 코드'를 소개합니다. 이중 접근 방식 설계를 통해 프록시 보상(테스트 통과/불통과)과 실제 보상(수학적 정확성)을 명확하게 분리하여, 보상 해킹 비율을 정확하게 측정할 수 있습니다. 이 환경을 사용하여 개방형 가중치 LLM에서 보상 해킹 현상을 연구한 결과, 지도 미세 조정(SFT) 과정에서 소량의 보상 해킹 경로가 학습 데이터에 포함되면 모델이 의도치 않게 이러한 행동을 학습할 수 있다는 것을 발견했습니다. 증류 SFT 데이터에서 1%의 오염만으로도 모델이 보상 해킹을 내재화하게 되며, 이는 이후 강화 학습(RL) 과정에서 다시 나타납니다. 또한, RL은 불일치를 증폭시키고 원래 도메인 범위를 넘어 일반화시키는 경향이 있음을 보여줍니다. 본 연구에서 개발한 환경과 코드를 공개하여 LLM에서의 보상 해킹 연구를 촉진하고자 합니다. 연구 결과는 LLM에서 보상 해킹이 발생하고 지속되는 이전에 잘 알려지지 않았던 경로를 보여주며, 합성 SFT 데이터에 대한 더욱 엄격한 검증의 필요성을 강조합니다. 코드 및 관련 자료는 https://github.com/zohaib-khan5040/Countdown-Code 에서 확인할 수 있습니다.

Original Abstract

Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib-khan5040/Countdown-Code.

7 Citations

0 Influential

24 Altmetric

127.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!