2604.20659v1 Apr 22, 2026 cs.LG

GRPO-VPS: 검증 가능한 프로세스 감시를 통한 그룹 상대 정책 최적화 개선: 효과적인 추론을 위한 방법

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

Yujia Liu

Citations: 0

h-index: 0

Tengjin Weng

Citations: 21

h-index: 2

Jierun Chen

Citations: 42

h-index: 3

Haoli Bai

Citations: 437

h-index: 9

Lu Hou

Citations: 351

h-index: 6

Lei Zhu

Citations: 19

h-index: 2

Haochen Tan

Citations: 35

h-index: 3

Chaofan Tao

Citations: 81

h-index: 5

Lifeng Shang

Citations: 39

h-index: 4

Jingyi Wang

Citations: 13

h-index: 1

Xiao-Ping Zhang

Citations: 4

h-index: 1

검증 가능한 보상을 활용한 강화 학습(RLVR)은 학습된 보상 모델 대신 직접적인 결과 검증을 통해 대규모 언어 모델(LLM)의 추론 능력을 향상시켰습니다. 이러한 패러다임을 바탕으로, 그룹 상대 정책 최적화(GRPO)는 크리틱 모델의 필요성을 없애지만, 중간 단계에 대한 무분별한 기여도 할당으로 인해 효과적인 추론 전략을 식별하는 능력이 제한되고 과도한 계산이 발생합니다. 본 연구에서는 모델의 추론 경로 전반에 걸쳐 정답에 대한 모델의 확신을 검사하는 방식으로 모델-프리(model-free)하고 검증 가능한 프로세스 감시 방법을 제안합니다. 생성 과정을 이산적인 단계로 나누고, 각 단계 경계에서 정답이 포함될 조건부 확률을 추적함으로써, GRPO의 경로 수준 피드백을 개선하기 위한 해석 가능한 단계별 진행률 측정을 효율적으로 계산합니다. 이 방법은 중간 수준의 감독 없이 비용이 많이 드는 몬테카를로 시뮬레이션 또는 보조 모델에서 파생된 감독을 피하면서, 보다 표적적이고 샘플 효율적인 정책 업데이트를 가능하게 합니다. 수학 및 일반 도메인 벤치마크에서 수행된 실험 결과, 다양한 모델에서 GRPO보다 일관된 성능 향상을 보였으며, 수학 문제에서는 최대 2.6점의 정확도 향상과 추론 길이 13.7% 감소, 일반 도메인 문제에서는 최대 2.4점과 4%의 성능 향상을 보여 강력한 일반화 성능을 입증했습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!