2604.02288v1 Apr 02, 2026 cs.LG

샘플 라우팅을 통한 그룹 상대적 정책 최적화 및 자체 증류 정책 최적화의 통합

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Tat-Seng Chua

Citations: 5

h-index: 1

Haiyun Guo

Citations: 1,641

h-index: 19

Jinqiao Wang

Citations: 14

h-index: 2

Gengsheng Li

Citations: 3

h-index: 1

Junfeng Fang

Citations: 55

h-index: 2

Mingyang Song

Citations: 126

h-index: 7

Dan Zhang

Citations: 37

h-index: 3

Tianyu Yang

Citations: 40

h-index: 4

Mao Zheng

Citations: 122

h-index: 7

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델의 후속 학습을 위한 표준 패러다임으로 자리 잡았습니다. 그룹 상대적 정책 최적화(GRPO)가 널리 사용되지만, GRPO는 세분화되지 않은 보상 할당 방식으로 실패한 시뮬레이션을 일률적으로 처벌하여 효율적인 개선을 위해 필요한 토큰 수준의 세부 사항을 고려하지 못합니다. 자체 증류 정책 최적화(SDPO)는 더 밀집되고 정교한 로짓 수준의 지침을 제공하여 빠른 초기 개선을 가능하게 하지만, 장기간 학습 중에 종종 불안정성을 보입니다. 우리는 이러한 후기 단계의 불안정성이 두 가지 근본적인 문제에서 비롯된다는 것을 발견했습니다. 첫째, 이미 정확한 샘플에 대한 자체 증류는 최적화의 모호성을 야기하고, 둘째, 자체 학습 모델의 신뢰성이 점진적으로 저하됩니다. 이러한 문제를 해결하기 위해, 우리는 샘플 라우팅 정책 최적화(SRPO)를 제안합니다. SRPO는 올바른 샘플을 GRPO의 보상 기반 강화 학습으로 라우팅하고, 실패한 샘플을 SDPO의 정교한 로짓 수준의 수정으로 라우팅하는 통합된 온-정책 프레임워크입니다. 또한, SRPO는 엔트로피를 고려하는 동적 가중치 메커니즘을 통합하여 신뢰성이 낮은 고엔트로피 증류 타겟을 억제하고, 신뢰도가 높은 타겟을 강조합니다. 우리는 다섯 가지 벤치마크와 두 가지 모델 크기로 SRPO를 평가한 결과, SDPO의 빠른 초기 개선과 GRPO의 장기적인 안정성을 모두 달성했습니다. SRPO는 일관되게 두 가지 기본 모델의 최고 성능을 능가했으며, Qwen3-8B 모델에서 GRPO보다 3.4%, SDPO보다 6.3% 더 높은 평균 성능을 보였습니다. 동시에, SRPO는 적당한 응답 길이를 제공하고, 단계별 계산 비용을 최대 17.2%까지 절감했습니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

1 Citations

0 Influential

9.5 Altmetric

48.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!