2602.18297v1 Feb 20, 2026 cs.LG

정보 이론을 통한 사고의 사슬(Chain-of-Thought) 모니터링 가능성 분석 및 향상

Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Christos Louizos

Citations: 33

h-index: 3

Usman Anwar

Citations: 1,547

h-index: 10

Tim Bakker

Citations: 6

h-index: 2

D. Kianfar

Citations: 21

h-index: 3

Cristina Pinneri

Citations: 9

h-index: 2

사고의 사슬(CoT) 모니터는 코드 생성 중의 테스트 해킹(test-hacking) 행동과 같이 출력 결과가 관심 있는 특성을 보일 수 있는 시점을 감지하기 위해 추론 과정을 분석하는 LLM 기반 시스템이다. 본 논문에서는 정보 이론적 분석을 통해 CoT와 출력 간의 0이 아닌 상호 정보(non-zero mutual information)가 CoT 모니터링 가능성을 위한 필요조건이지만 충분조건은 아님을 보여준다. 우리는 실제 환경에서 CoT 모니터의 성능을 저하시킬 수 있는 근사 오차(approximation error)의 두 가지 원인을 식별한다. 첫째는 모니터가 CoT에서 이용 가능한 정보를 얼마나 추출할 수 있는지를 측정하는 정보 격차(information gap)이며, 둘째는 모니터가 최적의 모니터링 함수에 얼마나 근사하는지를 측정하는 도출 오차(elicitation error)이다. 더 나아가, 우리는 표적화된 학습 목표를 통해 CoT 모니터링 가능성을 체계적으로 향상시킬 수 있음을 입증한다. 이를 위해 상호 보완적인 두 가지 접근법을 제안한다: (a) 모니터의 정확도를 극대화하는 CoT를 생성하도록 모니터링 대상 모델에 직접적으로 보상을 제공하는 오라클(oracle) 기반 방법, (b) 출력과 CoT 간의 조건부 상호 정보(conditional mutual information)를 극대화하는 보다 실용적인 레이블 프리(label-free) 방법이다. 여러 다양한 환경에 걸쳐, 우리는 두 방법 모두 모니터 정확도를 크게 향상시키는 동시에 모니터에 대항하여 학습할 때도 CoT의 퇴화를 방지하며, 이를 통해 작업 보상이 불완전하게 지정된 경우 발생하는 보상 해킹(reward hacking)을 완화한다는 것을 보여준다.

Original Abstract

Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice: information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an oracle-based method that directly rewards the monitored model for producing CoTs that maximize monitor accuracy, and (b) a more practical, label-free approach that maximizes conditional mutual information between outputs and CoTs. Across multiple different environments, we show both methods significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor, thereby mitigating reward hacking when the task reward is imperfectly specified.

3 Citations

0 Influential

5 Altmetric

28.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!