2603.30036v1 Mar 31, 2026 cs.LG

정렬, 직교 또는 충돌: 체인 오브 씽크(Chain-of-Thought)를 안전하게 최적화할 수 있는 조건은 언제인가?

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Max Kaufmann

Citations: 743

h-index: 3

David Lindner

Citations: 46

h-index: 3

Roland S. Zimmermann

Citations: 3,292

h-index: 4

Rohin Shah

Citations: 3,525

h-index: 8

체인 오브 씽크(CoT) 모니터링은 자동화된 시스템이 LLM의 CoT를 감시하여 AI 시스템을 효과적으로 관리하는 유망한 접근 방식입니다. 그러나 모델의 CoT가 모델 관리에 얼마나 도움이 되는지, 즉 CoT의 모니터 가능성은 모델이 추론 과정의 중요한 특징을 숨기도록 학습하는 등 훈련 과정에 의해 영향을 받을 수 있습니다. 본 연구에서는 이러한 현상이 발생할 때와 이유를 예측하는 개념적 프레임워크를 제안하고 실증적으로 검증합니다. 우리는 LLM의 사후 훈련을 강화 학습 환경으로 모델링하며, 보상은 최종 출력에 의존하는 항과 CoT에 의존하는 항으로 분해됩니다. 우리의 프레임워크를 통해 훈련 전에 이러한 두 항을 "정렬(aligned)", "직교(orthogonal)" 또는 "충돌(in-conflict)" 상태로 분류할 수 있습니다. 우리는 "충돌" 상태의 항으로 훈련하면 모니터 가능성이 감소하고, "직교" 상태의 항은 영향을 미치지 않으며, "정렬" 상태의 항은 모니터 가능성을 향상시킬 것으로 예측합니다. 우리의 프레임워크를 검증하기 위해, 우리는 다양한 강화 학습 환경을 분류하고, 해당 환경에서 LLM을 훈련하며, 훈련이 CoT 모니터 가능성에 미치는 영향을 평가합니다. 그 결과, (1) "충돌" 상태의 보상 항으로 훈련하면 CoT 모니터 가능성이 감소하고 (2) "충돌" 상태의 보상 항을 최적화하는 것은 어렵다는 것을 확인했습니다.

Original Abstract

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!