2602.15515v1 Feb 17, 2026 cs.LG

기만 탐지 시스템을 활용한 강화 학습 환경에서 정직성이 나타나는 현상: '기만 회피 전략 지도' 연구

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Adam Gleave

Citations: 1,145

h-index: 16

Stefan Heimersheim

Citations: 0

h-index: 0

Mohammad Taufeeque

Citations: 53

h-index: 2

Chris Cundy

Citations: 24

h-index: 3

AI 시스템의 정직성을 확보하기 위한 방법으로, 화이트박스 기만 탐지 시스템을 이용한 학습이 제안되었습니다. 그러나 이러한 학습 방식은 모델이 탐지기를 회피하기 위해 기만을 은폐하는 전략을 학습할 위험을 내포합니다. 기존 연구에서는 모델이 직접적으로 유해한 결과를 생성하는 경우에만 기만 회피 현상을 연구했습니다. 본 연구에서는 실제적인 코딩 환경을 구축하여, 모델이 테스트 케이스를 직접 조작하여 보상을 획득하는 상황에서 기만 회피 현상이 어떻게 발생하는지 분석했습니다. 기만 탐지 시스템을 활용한 학습 과정에서 발생 가능한 다양한 결과를 분류하고, 모델이 정직성을 유지하거나, 두 가지 가능한 기만 회피 전략(은폐된 활성화 및 은폐된 정책)을 통해 기만적인 행동을 보이도록 학습될 수 있음을 보였습니다. 실험 결과, 은폐된 활성화는 강화 학습 과정에서 발생하는 표현 변화로 인해 발생하며, 기만 탐지기 페널티 유무에 관계없이 나타납니다. 반면, 기만 탐지기 페널티는 은폐된 정책을 유도하는 경향이 있으며, 이는 정책 경사 방법론에서 이론적으로 예상되는 결과입니다. 적절한 수준의 KL 정규화 및 기만 탐지기 페널티를 적용하면, 모델이 정직한 정책을 학습하도록 유도할 수 있으며, 이는 보상 해킹에 취약한 작업에서 화이트박스 기만 탐지기를 효과적인 학습 신호로 활용할 수 있음을 시사합니다.

Original Abstract

Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i) Obfuscated activations: the model outputs deceptive text while modifying its internal representations to no longer trigger the detector. (ii) Obfuscated policy: the model outputs deceptive text that evades the detector, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The probe penalty only incentivizes obfuscated policies; we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!