2601.10201v1 Jan 15, 2026 cs.LG

PRL: 프로세스 보상 학습을 통한 LLM의 추론 능력 향상 및 추론 경계 확장

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Ruida Wang

Citations: 215

h-index: 6

Tong Zhang

Citations: 68

h-index: 5

Jiarui Yao

Citations: 284

h-index: 6

최근 대규모 언어 모델(LLM)의 추론 능력 향상은 지속적인 연구 주제입니다. 그러나 대부분의 관련 연구는 경로 수준의 결과 보상에 기반하며, 추론 과정에서의 세밀한 지도를 제공하지 못합니다. 또한, LLM을 최적화하기 위해 프로세스 신호를 결합하려는 기존 학습 프레임워크는 MCTS와 같이 번거로운 추가 단계나 별도의 보상 모델 학습 등에 의존하여, 학습 효율성을 저하시킵니다. 더욱이, 프로세스 신호 설계의 기본 원리는 엄격한 이론적 근거가 부족하여 최적화 메커니즘에 대한 이해를 방해합니다. 본 논문에서는 프로세스 보상 학습(Process Reward Learning, PRL)을 제안합니다. PRL은 엔트로피 정규화 강화 학습 목표를 중간 단계로 분해하고, 모델에 적절하게 할당될 수 있는 엄격한 프로세스 보상을 사용합니다. 이론적 동기 부여를 바탕으로, PRL의 수식을 도출했습니다. 이 수식은 기본적으로 보상 최대화 목표와 정책 모델과 참조 모델 간의 KL 발산 페널티 항의 합에 해당합니다. 그러나 PRL은 결과 보상을 프로세스 감독 신호로 변환하여, 강화 학습 최적화 과정에서의 탐색을 보다 효과적으로 안내할 수 있습니다. 실험 결과, PRL은 평균 @ n 지표로 측정되는 LLM의 평균 추론 성능을 향상시킬 뿐만 아니라, pass @ n 지표를 개선하여 추론 경계를 확장한다는 것을 보여줍니다. 광범위한 실험을 통해 PRL의 효과성을 검증하고 일반화할 수 있음을 확인했습니다.

Original Abstract

Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.

5 Citations

0 Influential

3 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!