2602.01198v1 Feb 01, 2026 cs.AI

효율적인 LLM 추론을 위한 상태 전이 프레임워크

A State-Transition Framework for Efficient LLM Reasoning

Yu Zhao

Citations: 196

h-index: 4

Longyue Wang

Citations: 149

h-index: 6

Tianqi Shi

Citations: 172

h-index: 3

Weihua Luo

Citations: 738

h-index: 13

Kaifu Zhang

Citations: 685

h-index: 12

Jinsong Su

Citations: 69

h-index: 5

Liang Zhang

Citations: 128

h-index: 5

긴 사고 사슬(Chain-of-Thought, CoT) 추론은 복잡한 추론 작업에서 대규모 언어 모델(LLM)의 성능을 크게 향상시키지만, 긴 CoT 시퀀스를 생성하는 데 소요되는 막대한 연산 및 메모리 비용은 효율성과 실용성을 제한합니다. 기존 연구들은 주로 CoT 시퀀스를 압축하여 LLM의 추론 효율성을 높이려 하지만, 이러한 접근 방식은 테스트 타임 스케일링(test-time scaling)과 충돌하여 LLM의 추론 역량을 제한하는 문제가 있습니다. 본 논문에서는 LLM의 추론 과정을 상태 전이(state-transition) 과정으로 모델링하는 효율적인 추론 프레임워크를 제안합니다. 구체적으로, 우리는 먼저 선형 어텐션(linear attention) 메커니즘을 적용하여 이전 추론 단계들의 과거 정보를 기록하는 LLM의 추론 상태를 추정합니다. 그 후, 쿼리 프롬프트와 추론 상태를 기반으로 LLM은 현재 추론 단계를 효율적으로 수행하고 상태를 업데이트합니다. 선형 어텐션을 통해 현재 추론 단계의 각 토큰은 이전 단계의 토큰들에 명시적으로 어텐션을 수행하지 않고도 추론 상태에서 관련 과거 정보를 직접 검색할 수 있습니다. 이를 통해 어텐션의 연산 복잡도가 2차(quadratic)에서 선형(linear)으로 감소하여 LLM의 추론 효율성을 획기적으로 개선합니다. 또한, 노이즈가 있는 추론 단계로 인한 과잉 사고(over-thinking) 문제를 완화하기 위해 상태 기반 추론 전략을 제안합니다. 다양한 데이터셋과 모델 크기에 대한 광범위한 실험 결과, 제안된 프레임워크가 LLM의 추론 효율성을 높일 뿐만 아니라 추론 성능 또한 향상시키는 것으로 나타났습니다.

Original Abstract

While Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks, the substantial computational and memory costs of generating long CoT sequences limit their efficiency and practicality. Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences. However, this approach conflicts with test-time scaling, limiting the reasoning capacity of LLMs. In this paper, we propose an efficient reasoning framework that models the reasoning process of LLMs as a state-transition process. Specifically, we first apply a linear attention mechanism to estimate the LLM's reasoning state, which records the historical reasoning information from previous reasoning steps. Then, based on the query prompt and the reasoning state, the LLM can efficiently perform the current reasoning step and update the state. With the linear attention, each token in the current reasoning step can directly retrieve relevant historical reasoning information from the reasoning state, without explicitly attending to tokens in previous reasoning steps. In this way, the computational complexity of attention is reduced from quadratic to linear, significantly improving the reasoning efficiency of LLMs. In addition, we propose a state-based reasoning strategy to mitigate the over-thinking issue caused by noisy reasoning steps. Extensive experiments across multiple datasets and model sizes demonstrate that our framework not only improves the reasoning efficiency of LLMs but also enhances their reasoning performance.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!