2604.16890v1 Apr 18, 2026 cs.AI

Step-GRPO: 효율적인 추론을 위한 동적 초기 종료 기능 내재화

Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

Weida Wang

Shanghai AI Laboratory

Citations: 139

h-index: 7

Min Zhang

Citations: 17

h-index: 3

Mingbao Lin

Citations: 6

h-index: 2

Benteng Chen

Citations: 27

h-index: 4

Shufei Zhang

Citations: 6

h-index: 2

복잡한 문제 해결에 뛰어난 긴 추론 과정을 사용하는 대규모 모델은 불필요한 검증에 컴퓨팅 자원을 낭비하는 경향이 있습니다. 이러한 과도한 사고를 억제하는 것은 어렵습니다. 학습 시간의 길이 페널티는 모델의 능력을 저하시키고, 추론 시간의 초기 종료는 시스템 오버헤드를 증가시킵니다. 이러한 격차를 해소하기 위해, 우리는 모델 자체에 동적 초기 종료 기능을 내재화하는 새로운 후처리 프레임워크인 Step-GRPO를 제안합니다. Step-GRPO는 언어적 마커를 사용하여 추론 과정을 구조화하고, 원시 토큰 대신 의미론적 단계를 최적화 대상으로 설정합니다. 우리는 탐색 과정에서 모델이 간결하고 신뢰할 수 있는 경로에 노출되도록 하는 동적 절단 Rollout 메커니즘을 도입하고, 그룹 수준의 기준선을 기반으로 중복을 동적으로 페널티하는 Step-Aware 상대 보상을 결합합니다. 세 가지 모델 크기에 대한 다양한 벤치마크에서 수행한 광범위한 실험 결과, Step-GRPO는 우수한 정확도-효율성 균형을 달성함을 보여줍니다. Qwen3-8B 모델에서, 우리의 방법은 기존 모델에 비해 토큰 소비량을 32.0% 줄이는 동시에, 기존의 길이 페널티 방법에서 관찰되는 정확도 저하를 방지합니다.

Original Abstract

Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!