2602.19416v1 Feb 23, 2026 cs.AI

IR^3: 해석 가능성을 위한 대조적 역강화 학습을 통한 보상 해킹 탐지 및 완화

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

M. Beigi

Citations: 66

h-index: 4

Qifan Wang

Citations: 280

h-index: 9

Lifu Huang

Citations: 4

h-index: 1

Ming Jin

Citations: 11

h-index: 2

Junshan Zhang

Citations: 2

h-index: 1

Jiaxi Zhang

Citations: 4

h-index: 2

인간 피드백 기반 강화 학습(RLHF)은 강력한 LLM 정렬을 가능하게 하지만, 모델이 실제 정렬 없이 프록시 보상의 허위 상관관계를 악용하여 보상 해킹을 유발할 수 있습니다. 더욱이, RLHF 과정에서 내재화되는 목표는 불투명하여 해킹 행위를 탐지하거나 수정하기 어렵습니다. 본 연구에서는 IR3(Interpretable Reward Reconstruction and Rectification)이라는 프레임워크를 소개합니다. IR3은 RLHF로 튜닝된 모델을 구동하는 암묵적인 목표를 역설계하고 해석하며, 수술적으로 수정합니다. 우리는 대조적 역강화 학습(C-IRL)을 제안합니다. C-IRL은 정렬 후 정책과 기준 정책에서 얻은 쌍별 응답을 비교하여, RLHF 과정에서의 행동 변화를 설명하는 암묵적인 보상 함수를 재구성합니다. 재구성된 보상은 희소 오토인코더를 사용하여 해석 가능한 특징으로 분해되며, 이를 통해 기여도 분석을 통해 해킹의 특징을 식별할 수 있습니다. 마지막으로, 우리는 문제적인 특징을 표적으로 삼으면서도 유용한 정렬을 유지하는 청결한 보상 최적화, 적대적 형상화, 제약 조건 최적화, 그리고 특징 기반 지식 증류와 같은 완화 전략을 제안합니다. 다양한 보상 모델 구성에 대한 실험 결과, IR3은 실제 보상과의 0.89의 상관관계를 달성하고, 90% 이상의 정확도로 해킹 특징을 식별하며, 해킹 행위를 크게 줄이면서 원래 모델의 성능 범위 내에서 3% 이내의 성능을 유지합니다.

Original Abstract

Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models. We propose Contrastive Inverse Reinforcement Learning (C-IRL), which reconstructs the implicit reward function by contrasting paired responses from post-alignment and baseline policies to explain behavioral shifts during RLHF. We then decompose the reconstructed reward via sparse autoencoders into interpretable features, enabling identification of hacking signatures through contribution analysis. Finally, we propose mitigation strategies - clean reward optimization, adversarial shaping, constrained optimization, and feature-guided distillation - that target problematic features while preserving beneficial alignment. Experiments across multiple reward model configurations show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!