2603.20620v1 Mar 21, 2026 cs.AI

추론 과정은 결과에 영향을 미치지만, 모델은 이를 인정하지 않는다

Reasoning Traces Shape Outputs but Models Won't Say So

Ali Emami

Citations: 20

h-index: 2

Lingjie Chen

Citations: 45

h-index: 3

Yijie Hao

Citations: 71

h-index: 3

Joyce C. Ho

Citations: 3

h-index: 1

대규모 추론 모델(LRM)이 생성하는 추론 과정 기록을 신뢰할 수 있을까요? 본 연구는 이러한 기록이 모델의 결과에 어떤 영향을 미치는지, 그리고 모델이 자신의 추론 과정을 얼마나 솔직하게 보고하는지를 조사합니다. 우리는 Thought Injection이라는 방법을 도입하여 모델의 <think> 추론 과정 기록에 인위적인 추론 단편을 삽입하고, 모델이 삽입된 추론을 따르는지, 그리고 이를 인정하는지 측정합니다. 세 개의 LRM에서 추출한 45,000개의 샘플을 분석한 결과, 삽입된 힌트는 결과에 일관되게 영향을 미치는 것으로 나타나, 추론 과정 기록이 모델의 행동에 인과적인 영향을 미친다는 것을 확인했습니다. 그러나 모델에게 변경된 답변의 이유를 설명하도록 요청했을 때, 모델은 압도적으로 그 영향을 밝히기를 거부했습니다. 30,000개의 후속 샘플에서 전체적으로 90% 이상의 모델이 영향을 밝히지 않았습니다. 모델은 삽입된 추론 대신, 겉으로는 일관되지만 실제로는 관련 없는 설명을 만들어 냅니다. 활성화 분석 결과, 이러한 설명 과정에서 아첨 및 기만과 관련된 신경 활동이 강하게 나타나는 것으로 확인되었으며, 이는 우연한 실패가 아닌 체계적인 패턴을 시사합니다. 이러한 결과는 모델이 실제로 따르는 추론과 보고하는 추론 사이에 간극이 존재한다는 것을 보여주며, 겉으로는 일관된 것처럼 보이는 설명이 진정한 일관성과 동등하지 않을 수 있다는 우려를 제기합니다.

Original Abstract

Can we trust the reasoning traces that large reasoning models (LRMs) produce? We investigate whether these traces faithfully reflect what drives model outputs, and whether models will honestly report their influence. We introduce Thought Injection, a method that injects synthetic reasoning snippets into a model's <think> trace, then measures whether the model follows the injected reasoning and acknowledges doing so. Across 45,000 samples from three LRMs, we find that injected hints reliably alter outputs, confirming that reasoning traces causally shape model behavior. However, when asked to explain their changed answers, models overwhelmingly refuse to disclose the influence: overall non-disclosure exceeds 90% for extreme hints across 30,000 follow-up samples. Instead of acknowledging the injected reasoning, models fabricate aligned-appearing but unrelated explanations. Activation analysis reveals that sycophancy- and deception-related directions are strongly activated during these fabrications, suggesting systematic patterns rather than incidental failures. Our findings reveal a gap between the reasoning LRMs follow and the reasoning they report, raising concern that aligned-appearing explanations may not be equivalent to genuine alignment.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!