2602.06960v2 Feb 06, 2026 cs.CL

InftyThink+: 강화 학습을 통한 효과적이고 효율적인 무한 지평선 추론

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Yuchen Yan

Zhejiang University

Citations: 490

h-index: 13

Liang Jiang

Citations: 15

h-index: 2

Jin Jiang

Citations: 41

h-index: 4

Shuaicheng Li

Citations: 41

h-index: 3

Zujie Wen

Citations: 210

h-index: 7

Zhiqiang Zhang

Citations: 24

h-index: 3

Jun Zhou

Citations: 78

h-index: 5

Jian Shao

Citations: 117

h-index: 6

Yueting Zhuang

Citations: 610

h-index: 14

Yongliang Shen

Citations: 397

h-index: 10

대규모 추론 모델은 추론 시간 동안의 연쇄적 사고(chain-of-thought)를 확장하여 뛰어난 성능을 달성하지만, 이 방식은 2차원 비용 증가, 컨텍스트 길이 제한, 그리고 중간 단계에서 정보 손실로 인한 추론 능력 저하와 같은 문제점을 가지고 있습니다. 반복적인 추론은 중간 단계의 사고를 주기적으로 요약하여 이러한 문제점을 완화하지만, 기존 방법은 지도 학습이나 고정된 휴리스틱에 의존하며, 언제 요약할지, 무엇을 보존할지, 그리고 어떻게 추론을 재개할지에 대한 최적화를 수행하지 못합니다. 본 연구에서는 모델이 제어하는 반복 경계와 명시적인 요약을 활용하여 전체 반복적 추론 과정을 최적화하는 엔드투엔드 강화 학습 프레임워크인 InftyThink+를 제안합니다. InftyThink+는 지도 학습을 통한 초기 학습 단계와 이후 궤적 수준의 강화 학습 단계를 결합한 2단계 학습 방식을 채택하여, 모델이 전략적인 요약 및 연속 결정 능력을 학습하도록 합니다. DeepSeek-R1-Distill-Qwen-1.5B 모델에 대한 실험 결과, InftyThink+는 AIME24 데이터셋에서 정확도를 21% 향상시키고, 기존의 장기 연쇄적 사고 강화 학습 방식보다 훨씬 뛰어난 성능을 보이며, 또한 일반화 성능 또한 향상되었습니다. 게다가, InftyThink+는 추론 지연 시간을 크게 줄이고 강화 학습 훈련 속도를 가속화하여, 더 강력한 성능과 함께 향상된 추론 효율성을 입증했습니다.

Original Abstract

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

4 Citations

0 Influential

7 Altmetric

39.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!