2602.09000v1 Feb 09, 2026 cs.AI

iGRPO: 자기 피드백 주도형 LLM 추론

iGRPO: Self-Feedback-Driven LLM Reasoning

Igor Gitman

Citations: 2,652

h-index: 16

Shrimai Prabhumoye

Citations: 500

h-index: 13

Ximing Lu

University of Washington

Citations: 6,740

h-index: 36

Yejin Choi

Citations: 691

h-index: 12

Seungju Han

Stanford

Citations: 1,623

h-index: 18

W. Ping

Citations: 171

h-index: 2

Jan Kautz

Citations: 907

h-index: 5

Ali Hatamizadeh

Citations: 6,901

h-index: 26

대규모 언어 모델(LLM)은 복잡한 수학 문제를 해결하는 데 가능성을 보여주었지만, 여전히 정확하고 일관된 솔루션을 생성하는 데는 미치지 못하고 있다. 강화 학습(RL)은 이러한 모델을 작업별 보상에 맞춰 정렬하여 전반적인 품질과 신뢰성을 향상시키는 프레임워크이다. 그룹 상대 정책 최적화(GRPO)는 그룹 상대 보상 정규화를 활용하는 Proximal Policy Optimization(PPO)의 효율적이고 가치 함수가 필요 없는 대안이다. 본 논문에서는 모델이 생성한 초안을 통해 동적 자기 조건화를 추가하는 GRPO의 2단계 확장인 반복적 그룹 상대 정책 최적화(iGRPO)를 소개한다. 1단계에서 iGRPO는 여러 탐색적 초안을 샘플링하고 최적화에 사용되는 것과 동일한 스칼라 보상 신호를 사용하여 가장 높은 보상의 초안을 선택한다. 2단계에서는 이 최적의 초안을 원본 프롬프트에 추가하고 초안 조건부 개선안에 대해 GRPO 스타일 업데이트를 적용하여, 정책이 이전의 가장 강력한 시도를 뛰어넘어 개선되도록 훈련한다. 동일한 롤아웃 예산 하에서 iGRPO는 다양한 기반 모델(예: Nemotron-H-8B-Base-8K 및 DeepSeek-R1 Distilled)에 걸쳐 GRPO를 일관되게 능가하며, 다양한 추론 벤치마크에서 그 효과를 입증하였다. 또한, AceReason-Math로 훈련된 OpenReasoning-Nemotron-7B에 iGRPO를 적용하여 AIME24와 AIME25에서 각각 85.62%와 79.64%라는 새로운 최고 성능(SOTA)을 달성했다. 소거 연구에 따르면 개선 래퍼는 GRPO 변형을 넘어 일반화되고, 생성적 심사자의 이점을 얻으며, 엔트로피 붕괴를 지연시켜 학습 역학을 변화시키는 것으로 나타났다. 이러한 결과는 검증 가능한 수학적 추론을 발전시키는 데 있어 반복적이고 자기 피드백에 기반한 RL의 잠재력을 강조한다.

Original Abstract

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

1 Citations

0 Influential

18 Altmetric

91.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!