2601.19280v1 Jan 27, 2026 cs.LG

LLM 추론을 위한 그룹 분포적 강건 최적화 기반 강화 학습

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Haitao Mi

Citations: 2,149

h-index: 20

Dong Yu

Citations: 1,382

h-index: 14

Kishan Panaganti

Citations: 396

h-index: 10

Zhenwen Liang

Citations: 257

h-index: 6

Wenhao Yu

Citations: 919

h-index: 16

최근 대규모 언어 모델(LLM)의 추론 능력 향상은 주로 훈련 후 손실 함수와 정렬 전략의 개선에 의해 주도되고 있습니다. 그러나 그룹 상대 정책 최적화(GRPO)와 같은 기존 강화 학습(RL) 패러다임은 여전히 정적인 균일성에 의해 제약받습니다. 즉, 균일한 프롬프트 샘플링과 프롬프트당 고정된 수의 실행(rollout)을 사용합니다. 이로 인해 이질적이고 꼬리가 긴 추론 데이터의 경우, 이미 해결된 패턴에 불필요한 컴퓨팅 자원을 낭비하고 어려운 문제에 대한 학습이 부족해지는 구조적 비효율성이 발생합니다. 이를 해결하기 위해, 본 연구에서는 훈련 분포를 동적으로 조정하는 최적화 우선 프레임워크인 다중 적대자 그룹 분포적 강건 최적화(GDRO)를 제안합니다. 우리는 온라인 난이도 분류기를 도입하여 프롬프트를 동적인 pass@k 난이도 그룹으로 분할합니다. 그런 다음, 훈련 후 단계에서 두 가지 독립적인 GDRO 게임을 제안합니다. (1) 프롬프트-GDRO는 EMA(Exponential Moving Average) 편향을 제거한 곱셈 가중치 밴디트 샘플러를 사용하여 집중적인 난이도 차이를 목표로 하고, 빈도 편향 없이 지속적으로 어려운 그룹에 가중치를 부여합니다. (2) 실행-GDRO는 그림자 가격 제어기를 사용하여 고정된 평균 예산(컴퓨팅 중립적) 하에서 어려운 작업에 대한 기울기 분산 감소를 최대화하면서 그룹 간의 실행 수를 재할당합니다. 우리는 두 제어기 모두에 대해 후회 보장을 제공하며, 또한 실행-GDRO에 대한 최적의 제곱근 실행 할당을 정당화하는 분산 근사 분석을 제공합니다. 우리는 Qwen3-Base 모델을 사용하여 DAPO 14.1k 데이터셋에서 본 프레임워크를 검증했습니다. 프롬프트-GDRO와 실행-GDRO는 각각 GRPO 기준 모델 대비 평균 10.6% 및 10.1%의 상대적인 성능 향상을 보였습니다. 질적 분석 결과, 적대자들은 자원을 변화하는 추론 영역으로 이동시키면서 추론 모델의 성능을 향상시키는 것으로 나타났습니다.

Original Abstract

Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!