2602.05547v1 Feb 05, 2026 cs.CL

다중 작업 GRPO: 다양한 작업에 걸친 신뢰성 있는 LLM 추론

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Matthieu Zimmer

Citations: 38

h-index: 4

S. Ramesh

Citations: 200

h-index: 6

Ilija Bogunovic

Citations: 1,941

h-index: 24

Xiaotong Ji

Citations: 14

h-index: 2

Sangwoong Yoon

Citations: 53

h-index: 4

Zhiyong Wang

Citations: 31

h-index: 3

H. Ammar

Citations: 821

h-index: 11

Aurélien Lucchi

Citations: 16,987

h-index: 44

GRPO는 강화 학습 기반의 추가 훈련 방법으로, 개별 추론 작업에서 대규모 언어 모델의 성능을 향상시키는 데 널리 사용됩니다. 그러나 실제 환경에서는 다양한 작업에 걸쳐 안정적인 성능이 필요합니다. GRPO를 다중 작업 환경에 직접 적용하는 방식은 종종 불균형적인 결과를 초래하며, 일부 작업은 최적화 과정을 지배하는 반면 다른 작업은 정체될 수 있습니다. 또한, 작업들은 프롬프트가 거의 항상 최적의 결과를 제공하지 않아 (따라서 기울기가 0인 경우) 효과적인 최적화 기여도를 왜곡하는 방식으로 크게 다를 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 (i) 최악의 작업 성능을 명시적으로 최적화하고 작업 간 균형 잡힌 진행을 촉진하기 위해 작업 가중치를 동적으로 조정하는 새로운 다중 작업 GRPO (MT-GRPO) 알고리즘을 제안합니다. 또한 (ii) 작업별 정책 기울기가 조정된 가중치를 반영하도록 비율을 유지하는 샘플링 방식을 도입합니다. 3개 작업 및 9개 작업 환경에서의 실험 결과, MT-GRPO는 최악의 작업 정확도 측면에서 기존 방법보다 일관되게 우수한 성능을 보였습니다. 특히, MT-GRPO는 표준 GRPO 및 DAPO에 비해 최악의 작업 성능이 각각 16-28% 및 6% 절대적으로 향상되었으며, 평균 정확도 또한 경쟁력 있는 수준을 유지했습니다. 더욱이, MT-GRPO는 3개 작업 환경에서 최악의 작업 정확도가 50%에 도달하는 데 필요한 훈련 단계를 50% 줄여, 다양한 작업에 걸쳐 신뢰성 있는 성능을 달성하는 데 훨씬 더 효율적임을 보여주었습니다.

Original Abstract

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

0 Citations

0 Influential

22 Altmetric

110.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!