2605.07776v1 May 08, 2026 cs.LG

언어 모델 '추론' 과정에서의 불확실성 추적

Tracing Uncertainty in Language Model "Reasoning"

Philipp Mondorf

Citations: 436

h-index: 6

Barbara Plank

Citations: 186

h-index: 6

J. Frellsen

Citations: 1,972

h-index: 23

Nils Grunefeld

Citations: 1

h-index: 1

Christian Hardmeier

Citations: 22

h-index: 2

Bertram Højer

Citations: 28

h-index: 2

Anna Rogers

Citations: 36

h-index: 2

Stefan Heinrich

Citations: 6

h-index: 1

언어 모델(LM)의 '추론', 일반적으로 Chain-of-Thought 또는 테스트 시간 스케일링으로 설명되며, 종종 벤치마크 성능을 향상시키지만, 이 과정의 근본적인 메커니즘은 아직 제대로 이해되지 않고 있습니다. 본 연구에서는 언어 모델이 생성하는 중간 토큰 시퀀스인 '추론' 과정을 모델의 상태 변화로 간주하고, 불확실성 정량화 관점에서 이러한 과정을 분석합니다. 각 추론 과정은 불확실성 프로파일로 요약되며, 이는 추론 과정 전반에 걸쳐 나타나는 불확실성 신호의 형태를 설명하는 몇 가지 특징(예: 기울기, 선형성)으로 구성됩니다. GSM8K 및 ProntoQA 데이터셋에서 평가된 5개의 언어 모델에 대해, 이러한 프로파일이 추론 과정이 정확한 최종 답변을 얻을지 여부를 예측하는 데 사용되었으며, 최대 0.807의 AUROC 값을 달성하여 최근 관련 연구를 크게 개선했습니다. 전체 추론 과정의 처음 몇백 개의 토큰만을 사용하여 0.807의 AUROC 값을 달성할 수 있었으며, 이는 오류가 생성 과정 초기에 감지될 수 있음을 시사합니다. 정확한 추론 과정과 부정확한 추론 과정을 상세히 비교한 결과, 불확실성 프로파일에 뚜렷한 질적인 차이가 나타났으며, 정확한 추론 과정은 더 가파르고 선형성이 낮은 불확실성 감소를 보이는 경향이 있었습니다. 종합적으로, 본 연구 결과는 불확실성 하에서의 의사결정을 기반으로 언어 모델 '추론' 과정의 생성 메커니즘을 연구하는 데 유용한 분석 도구를 제공한다는 것을 시사합니다.

Original Abstract

Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treating the "reasoning" traces, the intermediate token sequences generated by LMs, as evolving model states. We summarize each trace by an uncertainty trace profile: a small set of features describing the shape of the uncertainty signal over its trace, such as its slope and linearity. We find that across five LMs evaluated on GSM8K and ProntoQA, these profiles predict whether a trace yields a correct final answer with AUROC up to 0.807, improving markedly on recent related work. We reach AUROC 0.801 using only the first few hundred tokens of full traces, suggesting that errors can be detected early in the generation. A detailed comparison of correct and incorrect traces further reveals qualitatively distinct uncertainty profiles, with correct traces showing a steeper and less linear decline in uncertainty. Together, the results suggest that our method, grounded in decision-making under uncertainty, provides a principled lens for studying the generative process underlying LM "reasoning".

0 Citations

0 Influential

11.5 Altmetric

57.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!