2602.05548v2 Feb 05, 2026 cs.LG

숨겨진 이점 대칭의 폭로: GRPO가 탐색과 난이도 적응에 어려움을 겪는 이유

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

Zhangquan Chen

Citations: 144

h-index: 7

Zhiqiu Yu

Citations: 51

h-index: 3

Mengting Liu

Citations: 7

h-index: 2

He Zhang

Citations: 28

h-index: 2

Liangqiong Qu

Citations: 19

h-index: 3

검증 가능한 보상을 이용한 강화 학습(RLVR), 특히 GRPO는 LLM 추론을 유도하는 데 있어 표준적인 방법론이 되었습니다. 그러나 GRPO의 탐색 효율성과 난이도 적응 능력은 여전히 해결해야 할 과제로 남아 있습니다. 본 연구에서는 이러한 문제점이 Group Relative Advantage Estimation (GRAE)에 내재된 숨겨진 이점 대칭에서 비롯된다고 주장합니다. 이러한 대칭성은 다음과 같은 두 가지 중요한 제약을 야기합니다: (i) 그룹 수준에서, 올바른 경로와 잘못된 경로 간의 가중치에 엄격한 대칭이 존재하여, 샘플링되지 않은 액션 로짓 값을 변경하지 않아 새로운 올바른 해결책 탐색을 저해합니다. (ii) 샘플 수준에서, 알고리즘은 암묵적으로 중간 난이도의 샘플을 우선시하며, 난이도 변화에 대한 민감성을 잃게 됩니다. 통제된 실험을 통해, 이러한 대칭적인 특성이 최적이 아님을 밝혀냈으며, 다음과 같은 중요한 두 가지 통찰력을 얻었습니다: (i) 올바른 경로의 이점을 비대칭적으로 억제하는 것이 필수적인 탐색을 촉진합니다. (ii) 학습 효율성은 단순한 샘플부터 시작하여 점진적으로 복잡한 샘플로 전환하는 교육 과정과 유사한 방식으로 우선순위를 부여함으로써 극대화됩니다. 이러한 연구 결과를 바탕으로, 우리는 탐색 인센티브와 샘플 난이도에 대한 집중을 동적으로 조절하는 Asymmetric GRAE (A-GRAE)를 제안합니다. 일곱 개의 벤치마크에 대한 실험 결과, A-GRAE는 LLM과 MLLM 모두에서 GRPO 및 그 변형보다 일관되게 성능이 향상됨을 보여줍니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically suppressing the advantages of correct trajectories encourages essential exploration. (ii) learning efficiency is maximized by a curriculum-like transition-prioritizing simpler samples initially before gradually shifting to complex ones. Motivated by these findings, we propose Asymmetric GRAE (A-GRAE), which dynamically modulates exploration incentives and sample-difficulty focus. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.

4 Citations

0 Influential

3.5 Altmetric

21.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!