2603.05488v1 Mar 05, 2026 cs.CL

추론 극장: 체인 오브 씽크(Chain-of-Thought)에서 모델의 신념 분리

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

S. Boppana

Citations: 74

h-index: 5

An-gelos Ma

Citations: 21

h-index: 2

Max Loeffler

Citations: 107

h-index: 4

Raphael Sarfati

Citations: 91

h-index: 5

Eric J. Bigelow

Citations: 373

h-index: 7

Atticus Geiger

Citations: 383

h-index: 8

Owen Lewis

Citations: 38

h-index: 4

Jack Merullo

Citations: 727

h-index: 10

본 연구에서는 추론 모델에서 나타나는 '퍼포매티브 체인 오브 씽크(performative CoT)' 현상에 대한 증거를 제시합니다. 이 현상은 모델이 최종 답변에 대해 강한 확신을 갖지만, 내부적인 신념을 드러내지 않고 토큰 생성을 계속하는 것을 의미합니다. 우리는 두 개의 대규모 모델(DeepSeek-R1 671B 및 GPT-OSS 120B)에 대해 활성화 값 탐색(activation probing), 초기 강제 답변(early forced answering), 그리고 CoT 모니터링을 비교 분석했습니다. 분석 결과, 작업 난이도에 따라 차이가 나타나는 것을 확인했습니다. 특히, 쉬운 기억 기반 MMLU 질문의 경우, 모델의 최종 답변은 CoT 과정에서 훨씬 이른 단계에서 활성화 값으로부터 예측 가능하지만, 모니터가 이를 감지하는 데는 시간이 걸립니다. 반면, 어려운 다단계 추론이 필요한 GPQA-Diamond 질문에서는 진정한 추론이 수행되는 것을 관찰했습니다. 그럼에도 불구하고, 모델이 답변을 수정하거나 '깨달음'의 순간을 보이는 경우는, 활성화 값 탐색에서 큰 신념 변화가 나타나는 경우에만 거의 나타났습니다. 이는 이러한 행동이 학습된 '추론 극장'이 아닌, 진정한 불확실성을 반영하는 것임을 시사합니다. 마지막으로, 탐색 기반의 초기 종료(probe-guided early exit)를 통해 MMLU에서 최대 80%, GPQA-Diamond에서 최대 30%의 토큰 수를 줄이면서도 유사한 정확도를 유지할 수 있었습니다. 이는 활성화 값 탐색이 퍼포매티브 추론을 감지하는 효율적인 도구가 될 수 있으며, 적응적인 계산을 가능하게 한다는 것을 보여줍니다.

Original Abstract

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

20 Citations

2 Influential

5 Altmetric

49.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!