2602.24110v1 Feb 27, 2026 cs.AI

재활용 실패: 세분화된 오프라인 지침을 통한 RLVR에서의 탐색 복구

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Baosheng Yu

Citations: 2

h-index: 1

Jiaxing Huang

Citations: 542

h-index: 9

Yanwei Ren

Citations: 2

h-index: 1

Haotian Zhang

Citations: 2

h-index: 1

Likang Xiao

Citations: 1

h-index: 1

Xikai Zhang

Citations: 1

h-index: 1

Jiayan Qiu

Citations: 2

h-index: 1

Quan Chen

Citations: 7

h-index: 1

Liu Liu

Citations: 2

h-index: 1

검증 가능한 보상을 이용한 강화 학습(RLVR)은 대규모 추론 모델의 복잡한 추론 능력을 향상시키는 강력한 패러다임으로 부상했습니다. 그러나 표준적인 결과 기반 감독은 대부분 올바르지만 몇 가지 오류로 인해 실패하는 경로를 완전히 오류가 있는 경로와 동일하게 처벌한다는 중요한 한계점을 가지고 있습니다. 이러한 거친 피드백 신호는 모델이 귀중한 대부분 올바른 경로를 버리게 하여 탐색 공간을 조기에 좁히게 만들고, 경로 다양성을 저하시킵니다. 프로세스 보상 모델은 테스트 시간 확장을 위한 신뢰할 수 있는 단계별 검증을 제공하는 데 효과적임을 입증했지만, 이러한 신호를 RLVR에 밀도 기반 보상으로 통합하는 것은 효과가 없습니다. 기존 방법들은 오프라인 지침을 사용하여 전체 경로를 교체하려고 시도하지만, 이는 종종 정책 모델의 분포 밖에 있는 경우에도 발생하며, 모델 자체에서 생성된 대부분 올바른 경로를 활용하지 못하므로 탐색 공간이 좁아지는 현상을 효과적으로 완화하지 못합니다. 이러한 문제를 해결하기 위해, 우리는 프로세스 보상 모델을 사용하여 비최적 경로에서 첫 번째 오류 단계를 파악하고, 세분화된 단계별 오프라인 수정 방법을 적용하는 새로운 프레임워크인 SCOPE (Step-wise Correction for On-Policy Exploration)을 제안합니다. 부분적으로 올바른 경로에 정밀한 수정을 적용함으로써, 우리의 방법은 부분적으로 올바른 경로를 효과적으로 복구하고, 다양성 점수를 13.5% 향상시켜 광범위한 탐색 공간을 유지합니다. 광범위한 실험 결과, 우리의 접근 방식이 새로운 최고 성능을 달성하며, 수학적 추론에서 평균 정확도 46.6%를 달성하고, 분포 외부 추론 작업에서 53.4%의 정확도를 보이는 강력한 일반화 성능을 보여줍니다.

Original Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!