2605.29860v1 May 28, 2026 cs.LG

ESPO: Early-Stopping Proximal Policy Optimization

Wenhan Yu

Citations: 10

h-index: 2

Zhewen Tan

Citations: 24

h-index: 1

Binhua Li

Citations: 2,438

h-index: 21

Yongbin Li

Citations: 2,414

h-index: 21

Yingcheng Shi

Citations: 51

h-index: 2

Tong Yang

Citations: 22

h-index: 3

Zihang Li

Citations: 10

h-index: 2

Ruikang Zhou

Technical University of Munich

Citations: 2

h-index: 1

Jieping Ye

Citations: 186

h-index: 4

Zixiang Liu

Citations: 6

h-index: 1

Zeming Li

Citations: 86

h-index: 5

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!