2604.16972v1 Apr 18, 2026 cs.AI

MCPO: 대규모 추론 모델을 위한 숙달 통합 정책 최적화

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

Yi Yang

Citations: 715

h-index: 13

Ying Gao

Citations: 13

h-index: 2

Yong Hu

Citations: 13

h-index: 2

Jin-Fei Ding

Citations: 109

h-index: 4

Zhaokang Liao

Citations: 33

h-index: 2

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 유망한 방법론으로 부상했습니다. RLVR 알고리즘 중, 그룹 상대 정책 최적화(GRPO) 및 그 변형은 뛰어난 성능과 높은 학습 효율성을 보여주었습니다. 그러나 GRPO 스타일의 목적 함수는 높은 정확도를 갖는 프롬프트, 특히 숙달된 프롬프트(rollout 정확도 = 1) 및 다수 정답인 프롬프트(rollout 정확도가 (0.5, 1) 범위)에서 두 가지 문제를 야기합니다. 숙달된 프롬프트의 경우, 그룹 상대적 이점이 사라져 학습 신호가 없게 되고, 정책이 제약 없이 변경되어 정보가 손실될 수 있습니다. 다수 정답인 프롬프트의 경우, 유도된 쿼리 가중치가 정확도가 증가함에 따라 감소하여 부분적인 정확성에서 숙달로의 통합을 약화시킵니다. 이러한 문제를 해결하기 위해, 우리는 숙달 통합 정책 최적화(MCPO)를 제안합니다. MCPO는 (i) 연속적인 그래디언트 단계 간의 유해한 정책 변경을 제한하기 위해 숙달된 프롬프트에만 적용되는 힌지-KL 정규화기를 도입하고, (ii) 최적화 노력을 보다 효과적으로 분배하기 위해 다수 정답인 프롬프트를 우선시하는 가중치 메커니즘을 사용합니다. 세 가지 수학적 벤치마크에 대한 광범위한 실험 결과, MCPO는 일관되게 pass@1 성능을 향상시킵니다. 놀랍게도, MCPO는 탐색을 제한하는 대신 pass@k 지표를 향상시켜, 숙달 통합이 솔루션 다양성을 더욱 촉진한다는 것을 나타냅니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have demonstrated strong performance and high training efficiency. However, GRPO-style objectives exhibit two issues on high accuracy prompts including mastered prompts (rollout accuracy =1) and majority-correct prompts (rollout accuracy in (0.5,1)). For mastered prompts, group-relative advantages vanish, yielding no training signal and unconstrained policy drift that can cause forgetting. For majority-correct prompts, the induced query weight shrinks as accuracy increases, weakening consolidation from partial correctness to mastery. To alleviate this, we propose Mastery-Consolidated Policy Optimization (MCPO), which introduces (i) a hinge-KL regularizer applied exclusively to mastered prompts to bound harmful policy drift between successive gradient steps, and (ii) a weighting mechanism that prioritizes majority-correct prompts to better allocate optimization effort. Extensive experiments across three mathematical benchmarks demonstrate that MCPO consistently improves pass@1 performance. Counter-intuitively, rather than restricting exploration, MCPO boosts pass@k metrics, indicating that mastery consolidation further catalyzes solution diversity.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!