2602.03978v1 Feb 03, 2026 cs.AI

감시 가능성: 무료로 얻어지는 선물 - RLVR이 어떻게 자연스럽게 추론을 정렬시키는가

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning

Zidi Xiong

Citations: 79

h-index: 4

Shan Chen

Citations: 37

h-index: 3

Hima Lakkaraju

Citations: 150

h-index: 6

대규모 추론 모델(LRM)이 점점 더 많이 사용됨에 따라, 안전성을 확보하기 위해 추론 과정(chain-of-thought, CoT)을 감사하는 것이 중요해지고 있습니다. 최근 연구에서는 강화 학습과 검증 가능한 보상(RLVR)의 초기 단계에서 감시 가능성(CoT가 내부 계산을 얼마나 충실하고 유익하게 반영하는지)이 '무료로 얻어지는 선물'처럼 나타날 수 있다는 사실이 보고되었습니다. 본 연구에서는 다양한 모델 유형과 학습 영역에 대한 체계적인 평가를 통해 이러한 현상을 구체화했습니다. 연구 결과는 이 효과가 보편적이지 않으며, 감시 가능성 향상은 데이터에 크게 의존한다는 것을 보여줍니다. 특히, RLVR 학습 과정에서 데이터 다양성과 지시사항 준수 데이터의 중요한 역할을 입증했습니다. 또한, 감시 가능성이 성능과 독립적이라는 것을 보여줍니다. 즉, 추론 성능 향상이 투명성 증가를 의미하지 않습니다. 메커니즘 분석을 통해, 감시 가능성 향상은 주로 응답 분포의 선명화(엔트로피 감소)와 프롬프트에 대한 집중도 증가에 기인하며, 추론 과정에 대한 직접적인 인과적 의존성 증가에 기인하는 것은 아니라는 것을 밝혀냈습니다. 또한, 통제된 학습 및 평가 난이도에 따른 감시 가능성 변화를 분석했습니다. 종합적으로, 본 연구는 RLVR 하에서 감시 가능성이 어떻게 발생하는지에 대한 포괄적인 이해를 제공하며, 어떤 경우에 개선이 예상되는지, 그리고 어떤 경우에 그렇지 않은지를 명확히 합니다.

Original Abstract

As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability--the degree to which CoT faithfully and informatively reflects internal computation--can appear as a "free gift" during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability--improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!