2602.06375v1 Feb 06, 2026 cs.AI

난이도 추정 정책 최적화 (DEPO)

Difficulty-Estimated Policy Optimization

Yu Zhao

Citations: 196

h-index: 4

Longyue Wang

Citations: 149

h-index: 6

Weihua Luo

Citations: 738

h-index: 13

Tianle Liu

Citations: 67

h-index: 4

Bo Zeng

Citations: 181

h-index: 4

Yu Liu

Citations: 7

h-index: 2

Fan Jiang

Citations: 59

h-index: 3

딥시크-R1(DeepSeek-R1)으로 대표되는 대규모 추론 모델(LRM)의 최근 발전은 그룹 상대 정책 최적화(GRPO)를 통해 추론 시간 연산을 확장할 수 있는 잠재력을 강조해 왔습니다. 그러나 GRPO는 문제가 너무 사소하거나 지나치게 복잡할 때 빈번하게 그래디언트 신호 감쇠를 겪습니다. 이러한 시나리오에서는 그룹 간 이점(inter-group advantages)이 사라짐에 따라 그래디언트 신호가 노이즈에 취약해지고, 결과적으로 수렴 안정성을 저해합니다. DAPO와 같은 변형들이 그래디언트 소실 문제를 해결하려고 시도하지만, 효용이 낮은 샘플에 대한 소모적인 롤아웃으로 인해 발생하는 상당한 계산 오버헤드를 완화하지는 못합니다. 본 논문에서는 추론 정렬의 효율성과 견고성을 최적화하기 위해 고안된 새로운 프레임워크인 난이도 추정 정책 최적화(DEPO)를 제안합니다. DEPO는 롤아웃 단계 이전에 학습 데이터를 동적으로 평가하고 필터링하는 온라인 난이도 추정기를 통합합니다. 이 메커니즘은 학습 잠재력이 높은 샘플에 계산 자원이 우선적으로 배정되도록 보장합니다. 실증적 결과에 따르면 DEPO는 모델 성능을 저하시키지 않으면서 롤아웃 비용을 최대 2배까지 절감하는 것으로 나타났습니다. 우리의 접근 방식은 고성능 추론 모델 학습을 위한 계산 장벽을 크게 낮추어, 추론 확장을 위한 보다 지속 가능한 경로를 제공합니다. 코드와 데이터는 논문 게재 승인 시 공개될 예정입니다.

Original Abstract

Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!