2604.14142v1 Apr 15, 2026 cs.LG

P(y|x)에서 P(y)로: 사전 학습 공간에서의 강화 학습 연구

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Tian Liang

Citations: 57

h-index: 2

Minzheng Wang

Institute of Automation, Chinese Academy of Sciences

Citations: 304

h-index: 8

Zi-Yan Liu

Citations: 1,319

h-index: 12

Shizhu He

Citations: 7,961

h-index: 30

Kang Liu

Citations: 298

h-index: 11

Bo Liu

Citations: 508

h-index: 7

Yuqiao Tan

Citations: 42

h-index: 4

Junbo Zhao

Citations: 13

h-index: 2

검증 가능한 보상을 활용한 강화 학습(RLVR)은 조건부 분포 P(y|x)를 최적화하여 LLM의 추론 능력을 크게 향상시키지만, 그 잠재력은 근본적으로 기본 모델의 기존 출력 분포에 의해 제한됩니다. 사전 학습 공간에서 주변 분포 P(y)를 최적화하는 것은 추론 능력을 효율적으로 반영하고 광범위한 탐색 능력을 유지하여 이러한 병목 현상을 해결합니다. 그러나 기존의 사전 학습은 수동적인 학습을 위해 정적인 코퍼스를 사용하며, 이는 목표 추론 능력 향상을 저해하는 분포 변화를 초래합니다. 본 논문에서는 P(y)에 직접 보상 기반의 온라인 업데이트를 적용하는 PreRL(Pre-train Space RL)을 제안합니다. 우리는 이론적 및 실험적으로 log P(y)와 log P(y|x) 간의 강력한 기울기 정렬성을 검증하여 PreRL이 표준 강화 학습의 유효한 대체 수단임을 입증합니다. 또한, PreRL 내의 중요한 메커니즘인 부정 샘플 강화(NSR)가 추론 능력을 향상시키는 데 매우 효과적임을 밝혀냈습니다. NSR-PreRL은 잘못된 추론 공간을 빠르게 제거하고, 동시에 내재적인 반성적 행동을 촉진하여 전이적 사고와 반성적 사고를 각각 14.89배, 6.54배 증가시켰습니다. 이러한 통찰력을 바탕으로, 우리는 NSR-PreRL을 사용하여 모델을 초기화하고, 정밀한 최적화를 위해 표준 강화 학습으로 전환하는 정책 재구현 전략인 Dual Space RL (DSRL)을 제안합니다. 광범위한 실험 결과는 DSRL이 강력한 기준 모델보다 일관되게 우수한 성능을 보이며, 사전 학습 공간에서의 제거 기술이 정책을 정교한 올바른 추론 하위 공간으로 효과적으로 유도한다는 것을 입증합니다.

Original Abstract

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

0 Citations

0 Influential

15 Altmetric

75.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!