2603.07078v1 Mar 07, 2026 cs.AI

CoTJudger: 그래프 기반 프레임워크를 이용한 LRM의 체인-오브-소울 추론 효율성 및 중복성 자동 평가

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Ge Zhang

Citations: 135

h-index: 6

Min Yang

Citations: 128

h-index: 6

H. Alinejad-Rokny

Citations: 211

h-index: 9

Shiwen Ni

Citations: 729

h-index: 12

Jiaheng Liu

Citations: 963

h-index: 17

Wenhao Huang

Citations: 405

h-index: 8

Siying Li

Citations: 4

h-index: 1

Jiajun Shi

Citations: 277

h-index: 6

Shuaimin Li

Citations: 52

h-index: 2

Shijian Wang

Citations: 62

h-index: 2

Zhoufutu Wen

Citations: 483

h-index: 8

Yizhi Li

University of Manchester

Citations: 1,877

h-index: 25

대규모 추론 모델(LRM)은 답변을 생성하기 전에 확장된 체인-오브-소울(CoT) 추론 과정을 통해 강력한 성능을 보여줍니다. 그러나 이러한 방식은 종종 과도한 추론을 유발하며, 이는 결과 개선 없이 계산 비용을 증가시키는 불필요한 계산과 순환적인 자기 검증을 포함합니다. 기존의 평가는 주로 최종 정확도 또는 대략적인 토큰 수에 초점을 맞추고 있으며, 필수적인 논리를 구조적인 중복성으로부터 분리하는 자동화된 도구가 부족합니다. 본 논문에서는 자유 형식의 CoT 추론 과정을 방향성 의존성 그래프로 변환하고, 올바른 해결책에 도달하는 데 필요한 최단 효과 경로(SEP)를 추출하여 추론 효율성을 정량화하는 그래프 기반 프레임워크인 CoTJudger를 소개합니다. 이를 통해 모델과 작업 전반에 걸쳐 비교 가능한 해석 가능한 효율성 지표를 얻을 수 있으며, 이는 CoT 추론 과정에서 얼마나 많은 부분이 필수적인지, 그리고 얼마나 많은 부분이 구조적으로 중복되는지를 나타냅니다. 21개의 LRM을 평가한 결과, CoTJudger는 광범위한 중복성을 드러내고, 검증 집착 및 보상적 중복성과 같은 반복적인 오류 패턴을 식별합니다. 이러한 결과는 추론 능력과 계산적인 낭비를 분리하는 실용적인 지표를 제공하며, LRM의 효율성을 보다 정확하게 평가하고 진단하는 데 기여합니다.

Original Abstract

Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.

1 Citations

0 Influential

12.5 Altmetric

63.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!