2603.21877v1 Mar 23, 2026 cs.LG

P^2O: 정책 최적화 및 프롬프트 최적화의 통합

P^2O: Joint Policy and Prompt Optimization

Yaojie Lu

Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences

Citations: 3,067

h-index: 25

Hongyu Lin

Citations: 4,172

h-index: 29

Xianpei Han

Citations: 4,428

h-index: 29

Boxi Cao

Institute of Software, Chinese Academy of Sciences

Citations: 956

h-index: 13

Xinyu Lu

iscas.ac.cn

Citations: 185

h-index: 9

Kai Zhang

Citations: 66

h-index: 2

Jingli Yang

Citations: 78

h-index: 2

M. He

Citations: 3

h-index: 1

Le Sun

Citations: 5,097

h-index: 31

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 강력한 패러다임으로 부상했습니다. 그러나 일반적인 RLVR은 비효율적인 탐색 문제를 겪으며, 특히 성공률이 거의 0%에 가까운 "어려운 샘플"에 직면할 때 이러한 문제가 더욱 심각합니다. 이러한 경우, 희소한 결과 보상에 의존하면 모델이 높은 정보 가치를 지닌 이러한 사례들로부터도 적절한 지도 신호를 받지 못하게 되어, 모델 학습이 어려워집니다. 이러한 문제를 해결하기 위해, 프롬프트 최적화와 정책 최적화를 결합한 새로운 프레임워크인 P^2O를 제안합니다. P^2O는 훈련 과정에서 어려운 샘플을 식별하고, GeneticPareto (GEPA) 프롬프트 최적화 알고리즘을 사용하여 모델이 성공적인 경로를 발견하도록 안내하는 프롬프트 템플릿을 진화시킵니다. 중요한 점은, 기존의 프롬프트 엔지니어링 방법이 입력 데이터 증강에 의존하는 것과 달리, P^2O는 최적화된 프롬프트가 제공하는 추론 능력을 직접 모델 파라미터에 통합합니다. 이러한 메커니즘은 어려운 샘플에 대해 더 밀집된 긍정적인 지도 신호를 제공하고 수렴 속도를 가속화합니다. 광범위한 실험 결과, P^2O는 기존 데이터셋에서 뛰어난 성능을 달성할 뿐만 아니라, 일반화 능력 또한 뛰어나 외부 데이터셋에 대한 성능을 크게 향상시켰습니다 (+4.7% 평균 증가).

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).

1 Citations

0 Influential

15.5 Altmetric

78.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!