2605.29860v1 May 28, 2026 cs.LG

ESPO: Early-Stopping Proximal Policy Optimization

Wenhan Yu
Wenhan Yu
Citations: 10
h-index: 2
Zhewen Tan
Zhewen Tan
Citations: 24
h-index: 1
Binhua Li
Binhua Li
Citations: 2,438
h-index: 21
Yongbin Li
Yongbin Li
Citations: 2,414
h-index: 21
Yingcheng Shi
Yingcheng Shi
Citations: 51
h-index: 2
Tong Yang
Tong Yang
Citations: 22
h-index: 3
Zihang Li
Zihang Li
Citations: 10
h-index: 2
Ruikang Zhou
Ruikang Zhou
Technical University of Munich
Citations: 2
h-index: 1
Jieping Ye
Jieping Ye
Citations: 186
h-index: 4
Zixiang Liu
Zixiang Liu
Citations: 6
h-index: 1
Zeming Li
Zeming Li
Citations: 86
h-index: 5

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

0 Citations
0 Influential
10.5 Altmetric
52.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!