2605.25507v1 May 25, 2026 cs.AI

Credit Assignment with Resets in Language Model Reasoning

Ankur Samanta
Ankur Samanta
Citations: 30
h-index: 4
Akshayaa Magesh
Akshayaa Magesh
Citations: 105
h-index: 6
Kavosh Asadi
Kavosh Asadi
Citations: 1,321
h-index: 14
Kaveh Hassani
Kaveh Hassani
Citations: 2,590
h-index: 14
Paul Sajda
Paul Sajda
Citations: 29
h-index: 3
Jalaj Bhandari
Jalaj Bhandari
Citations: 801
h-index: 8
Ayush Jain
Ayush Jain
Citations: 16
h-index: 3
Youliang Yu
Youliang Yu
Citations: 16
h-index: 3
Daniel Jiang
Daniel Jiang
Citations: 34
h-index: 4
Yonathan Efroni
Yonathan Efroni
Citations: 0
h-index: 0

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

0 Citations
0 Influential
7 Altmetric
35.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!