2603.29993v1 Mar 31, 2026 cs.AI

MONA 확장 연구: 카메라 드롭박스 환경에서의 재현, 학습 기반 승인, 그리고 보상 해킹 방지를 위한 설계 시사점

Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

Citations: 0

h-index: 0

근시적인 최적화와 원거리 시각 승인(MONA)은 에이전트의 계획 범위를 제한하면서, 장기적인 관점의 승인을 학습 신호로 제공하여 다단계 보상 해킹을 완화합니다~\[cite{farquhar2025mona}\] . 원 논문에서는 중요한 미해결 질문을 제기합니다. 즉, 승인 생성 방법, 특히 승인이 달성된 결과에 얼마나 의존하는지가 MONA의 안전성 보장이 유지되는지에 어떤 영향을 미치는지에 대한 질문입니다. 우리는 공개된 MONA 카메라 드롭박스 환경을 확장하여 (i) 공개된 코드를 표준 파이썬 프로젝트로 재구성하고, 스크립트 기반 PPO 훈련을 적용하고, (ii) 공개된 참조 데이터를 사용하여 일반적인 강화 학습(91.5%의 보상 해킹률)과 이상적인 MONA(0.0%의 해킹률) 간의 대비 결과를 확인하고, (iii) 이상적인 승인, 노이즈 승인, 잘못 지정된 승인, 학습 기반 승인, 그리고 교정된 승인 메커니즘을 포괄하는 모듈화된 학습 기반 승인 시스템을 도입합니다. 승인 방법, 계획 범위, 데이터 세트 크기, 그리고 교정 전략에 대한 제한적인 실험 결과, 최적으로 교정된 학습 기반 감독 시스템은 관찰된 보상 해킹이 0%에 달했지만, 이상적인 MONA보다 의도된 행동률이 현저히 낮았습니다(11.9% vs. 99.9%). 이는 최적화 부족으로 인한 결과이며, 보상 해킹이 재발하는 것은 아닙니다. 이러한 결과는 MONA 논문의 승인 스펙트럼 가설을 실행 가능한 실험 대상으로 구현하며, 핵심적인 엔지니어링 과제가 MONA의 개념을 증명하는 것에서 학습 기반 승인 모델을 구축하는 것으로 전환되어야 한다는 것을 시사합니다. 이러한 모델은 충분한 예측 능력을 유지하면서 보상 해킹 경로를 다시 열지 않아야 합니다. 코드, 설정 파일, 그리고 재현을 위한 명령어는 공개적으로 제공됩니다. https://github.com/codernate92/mona-camera-dropbox-repro

Original Abstract

Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent's planning horizon while supplying far-sighted approval as a training signal~\cite{farquhar2025mona}. The original paper identifies a critical open question: how the method of constructing approval -- particularly the degree to which approval depends on achieved outcomes -- affects whether MONA's safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5\% reward-hacking rate) and oracle MONA (0.0\% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9\% vs.\ 99.9\%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper's approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA's concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona-camera-dropbox-repro

0 Citations

0 Influential

20 Altmetric

100.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!