2603.01106v1 Mar 01, 2026 cs.AI

DIVA-GRPO: 난이도 적응형 변형 이점 활용을 통한 다중 모드 추론 능력 향상

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Xueqi Cheng

Citations: 1,787

h-index: 21

Liang Pang

Citations: 700

h-index: 15

Hao Gao

Citations: 147

h-index: 7

Fangda Guo

Citations: 142

h-index: 8

Hongjian Dou

Citations: 1

h-index: 1

Guannan Lv

Citations: 0

h-index: 0

Tingting Gao

Citations: 5

h-index: 1

Huawei Shen

Citations: 355

h-index: 11

Zhenyu Zhang

Citations: 5

h-index: 1

Shaoguo Liu

Citations: 29

h-index: 2

그룹 상대 정책 최적화(GRPO)를 활용한 강화 학습(RL)은 다중 모드 대규모 언어 모델(MLLM)의 추론 능력을 향상시키는 널리 사용되는 접근 방식입니다. GRPO는 비평가(critic) 없이 긴 추론 과정을 가능하게 하지만, 어려운 문제에서는 희소한 보상(sparse reward)을 겪고, 그룹 수준의 보상이 지나치게 쉽거나 어려운 문제에서 일관적일 경우 이점(advantage)이 사라지는 문제가 있습니다. 기존의 해결책(샘플 확장, 선택적 활용, 간접 보상 설계)은 종종 그룹 내 보상 분포의 충분한 변동성을 유지하지 못하여 명확한 최적화 신호를 얻는 데 실패합니다. 이러한 문제를 해결하기 위해, 본 논문에서는 전역적인 관점에서 변형의 난이도 분포를 조정하는 난이도 적응형 변형 이점 방법인 DIVA-GRPO를 제안합니다. DIVA-GRPO는 문제의 난이도를 동적으로 평가하고, 적절한 난이도 수준의 변형을 샘플링하며, 난이도 가중치와 정규화 스케일링을 사용하여 로컬 및 글로벌 그룹 간의 이점을 계산합니다. 이를 통해 보상 희소성 및 이점 소실 문제를 완화하고, 학습 안정성을 향상시킵니다. 6개의 추론 벤치마크에 대한 광범위한 실험 결과, DIVA-GRPO가 기존 접근 방식보다 학습 효율성과 추론 성능 측면에서 우수한 성능을 보임을 확인했습니다. 코드: https://github.com/Siaaaaaa1/DIVA-GRPO

Original Abstract

Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO

1 Citations

0 Influential

35.993061443341 Altmetric

181.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!