2602.14444v1 Feb 16, 2026 cs.LG

끊어진 사슬: LLM에서 불완전한 추론의 비용

Broken Chains: The Cost of Incomplete Reasoning in LLMs

Maheep Chaudhary

Citations: 18

h-index: 2

Ian Su

Citations: 2

h-index: 1

Gaurav Purushothaman

Citations: 0

h-index: 0

Jeyani Narayan

Citations: 0

h-index: 0

R. Goel

Citations: 48

h-index: 3

Kevin Zhu

Citations: 13

h-index: 2

Sunishchal Dev

Citations: 32

h-index: 4

Yash More

Citations: 88

h-index: 4

OpenAI의 5.1과 DeepSeek-V3.2와 같은 추론에 특화된 모델들은 확장된 연쇄 추론(CoT) 과정을 위해 상당한 연산 자원을 할당하지만, 추론 토큰은 상당한 비용을 발생시킵니다. 코드, 자연어, 혼합 또는 전혀 추론을 사용하지 않는 다양한 추론 방식이 토큰 제약 조건 하에서 어떻게 수행될까요? 우리는 모델이 코드, 주석, 둘 다 또는 아무것도 사용하지 않고 오직 이를 통해서만 추론하도록 제한하는 프레임워크를 도입하고, 토큰 예산을 최적 값의 10%, 30%, 50% 및 70%로 체계적으로 감소시킵니다. 우리는 GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1의 네 가지 최첨단 모델을 수학 벤치마크(AIME, GSM8K, HMMT)에서 평가했습니다. 우리의 연구 결과는 다음과 같습니다. (1) **추론이 잘리는 경우 성능 저하가 발생할 수 있습니다.** DeepSeek-V3.2는 추론을 사용하지 않을 때 53%의 정확도를 달성했지만, 50%의 예산으로 잘린 CoT를 사용할 때는 17%에 불과했습니다. (2) **코드는 우아하게 성능 저하를 보입니다.** Gemini의 주석이 0%로 붕괴되는 반면, 코드는 43-47%의 성능을 유지합니다. (3) **혼합 추론은 단일 방식보다 성능이 낮습니다.** (4) **강건성은 모델에 따라 다릅니다.** Grok은 30% 예산에서 80-90%의 성능을 유지하는 반면, OpenAI와 DeepSeek는 7-27%로 급격히 성능이 저하됩니다. 이러한 결과는 불완전한 추론 과정이 모델을 적극적으로 오도하며, 이는 자원 제약 조건 하에서 추론에 특화된 시스템을 배포하는 데 중요한 의미를 갖습니다.

Original Abstract

Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!