2603.10848v1 Mar 11, 2026 cs.LG

V₀.₅: 일반화된 가치 모델을 희소 강화 학습 시뮬레이션의 사전 지식으로 활용

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Xunliang Cai

Citations: 74

h-index: 5

Hongyan Hao

Citations: 115

h-index: 5

Yi-Kai Zhang

Citations: 35

h-index: 3

Yueqing Sun

Citations: 32

h-index: 4

Qi Gu

Citations: 82

h-index: 5

De-Chuan Zhan

Citations: 648

h-index: 13

Han-Jia Ye

Citations: 287

h-index: 6

검증 가능한 보상을 사용하는 강화 학습(RLVR)에서, 강력한 이점 기준선을 구축하는 것은 정책 경사 방법에서 매우 중요하며, 정책 모델이 바람직한 행동을 강화하도록 효과적으로 안내합니다. 최근 연구에서는 모델의 능력을 명시적으로 컨텍스트 내에 인코딩하여 사전 학습된 가치 추정을 달성하는 일반화된 가치 모델(예: V₀)이 제시되었습니다. 이러한 모델은 가치 모델과 정책 모델을 동시에 업데이트할 필요성을 없애줍니다. 본 논문에서는 V₀.₅를 제안합니다. V₀.₅는 이러한 가치 모델이 예측하는 기준선(사전 지식으로 작용)과 희소 시뮬레이션에서 얻은 경험적 평균을 적응적으로 결합합니다. 이를 통해 계산 효율성과 매우 낮은 분산을 동시에 갖는 강력한 기준선을 구축합니다. 구체적으로, 실시간 통계 검정과 동적 예산 할당을 도입하여, 희소 샘플링으로 인한 높은 분산과 가치 모델의 사전 지식에 내재된 체계적인 편향(또는 환각)을 균형 있게 조절합니다. 시스템은 실시간으로 사전 지식의 신뢰성을 평가하기 위한 가설 검정을 수행하고, 필요에 따라 추가적인 시뮬레이션 예산을 동적으로 할당합니다. 이러한 메커니즘은 기준선 추정기의 평균 제곱 오차(MSE)를 최소화하여, 그룹 크기가 4인 극단적인 희소 환경에서도 안정적인 정책 경사를 보장합니다. 6개의 수학적 추론 벤치마크에 대한 광범위한 평가 결과, V₀.₅는 GRPO 및 DAPO보다 훨씬 뛰어난 성능을 보이며, 더 빠른 수렴 속도와 약 10%의 성능 향상을 달성했습니다.

Original Abstract

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

2 Citations

0 Influential

6.5 Altmetric

34.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!