2603.04135v1 Mar 04, 2026 cs.LG

효율적인 그룹 기반 정책 최적화를 위한 편향되지 않는 동적 가지치기

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Haodong Zhu

Citations: 121

h-index: 3

Yangyang Ren

Citations: 9

h-index: 1

Yanjing Li

Beihang University

Citations: 917

h-index: 13

Linlin Yang

Citations: 47

h-index: 5

Xuhui Liu

Citations: 755

h-index: 11

Xiantong Zhen

Citations: 7

h-index: 2

Mingbao Lin

Citations: 6

h-index: 2

Haiguang Liu

Citations: 36

h-index: 2

Baochang Zhang

Citations: 162

h-index: 6

그룹 상대 정책 최적화(GRPO)는 LLM 추론을 효과적으로 확장하지만, 광범위한 그룹 기반 샘플링 요구 사항으로 인해 엄청난 계산 비용이 발생합니다. 최근의 선택적 데이터 활용 방법은 이러한 오버헤드를 줄일 수 있지만, 기본 샘플링 분포를 변경하여 추정 편향을 유발하고, 이론적 엄밀성과 수렴 동작을 저해할 수 있습니다. 이러한 제한 사항을 해결하기 위해, 중요 샘플링 기반 보정을 통해 편향되지 않은 기울기 추정을 유지하면서 동적 가지치기를 가능하게 하는 프레임워크인 동적 가지치기 정책 최적화(DPPO)를 제안합니다. 수학적으로 유도된 재조정 계수를 통합함으로써, DPPO는 전체 배치 기준선의 최적화 목표를 변경하지 않고도 GRPO 훈련을 크게 가속화합니다. 또한, 가지치기로 인한 데이터 희소성을 완화하기 위해, 유효한 토큰 밀도와 하드웨어 활용도를 최대화하는 윈도우 기반 탐욕적 전략인 밀집 프롬프트 패킹을 도입합니다. 광범위한 실험 결과, DPPO는 다양한 모델 및 벤치마크에서 일관되게 훈련 속도를 향상시키는 것으로 나타났습니다. 예를 들어, MATH 데이터셋으로 훈련된 Qwen3-4B 모델에서, DPPO는 2.37배의 훈련 속도 향상을 달성했으며, 평균 정확도에서 GRPO보다 6개의 수학적 추론 벤치마크에서 3.36% 더 우수한 성능을 보였습니다.

Original Abstract

Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs due to its extensive group-based sampling requirement. While recent selective data utilization methods can mitigate this overhead, they could induce estimation bias by altering the underlying sampling distribution, compromising theoretical rigor and convergence behavior. To address this limitation, we propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction. By incorporating mathematically derived rescaling factors, DPPO significantly accelerates GRPO training without altering the optimization objective of the full-batch baseline. Furthermore, to mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization. Extensive experiments demonstrate that DPPO consistently accelerates training across diverse models and benchmarks. For instance, on Qwen3-4B trained on MATH, DPPO achieves 2.37$\times$ training speedup and outperforms GRPO by 3.36% in average accuracy across six mathematical reasoning benchmarks.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!