2601.20614v1 Jan 28, 2026 cs.AI

어려울수록 좋다: 난이도 인식 GRPO와 다각적 문제 재구성을 통한 수학적 추론 강화

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yong Wang

Citations: 293

h-index: 9

Xiangxiang Chu

Citations: 164

h-index: 7

Yanqi Dai

Citations: 81

h-index: 4

Xiao Zhang

Citations: 129

h-index: 2

Zhiwu Lu

Citations: 51

h-index: 4

Yuxiang Ji

Citations: 111

h-index: 6

검증 가능한 보상을 활용한 강화학습(RLVR)은 거대 모델의 수학적 추론 능력을 향상시키는 강력한 메커니즘을 제공합니다. 그러나 우리는 개발되지 않은 능력을 정교화하는 데 있어 고난도 질문이 중요함에도 불구하고, 기존 방법론들이 알고리즘 및 데이터 관점 모두에서 이러한 질문들을 체계적으로 강조하지 않고 있음을 확인했습니다. 알고리즘적으로는 널리 사용되는 그룹 상대 정책 최적화(GRPO)가 더 어려운 질문에 대해 정책 업데이트의 크기가 작아지는 내재적 불균형 문제를 겪고 있습니다. 데이터 측면에서는 증강 접근법들이 본질적인 난이도를 체계적으로 높이기보다는 주로 다양성 확보를 위해 질문을 재서술하는 데 그치고 있습니다. 이러한 문제를 해결하기 위해, 우리는 두 관점 모두에서 더 어려운 질문을 공략하여 수학적 추론을 개선하는 MathForge 프레임워크를 제안합니다. 이는 난이도 인식 그룹 정책 최적화(DGPO) 알고리즘과 다각적 문제 재구성(MQR) 전략으로 구성됩니다. 구체적으로 DGPO는 난이도 균형 그룹 이점 추정을 통해 GRPO의 내재적 불균형을 교정하고, 난이도 인식 질문별 가중치를 통해 더 어려운 질문에 우선순위를 둡니다. 한편, MQR은 원래의 정답을 유지하면서 난이도를 높이기 위해 여러 측면에서 질문을 재구성합니다. 결과적으로 MathForge는 시너지 루프를 형성합니다. MQR이 데이터의 경계를 확장하면 DGPO가 증강된 데이터로부터 효과적으로 학습합니다. 광범위한 실험 결과, MathForge는 다양한 수학적 추론 작업에서 기존 방법론들을 크게 능가하는 것으로 나타났습니다. 코드와 증강된 데이터는 https://github.com/AMAP-ML/MathForge 에서 확인할 수 있습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.

7 Citations

0 Influential

48.641568686512 Altmetric

250.2 Score

Original PDF

124

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!