2601.06795v3 Jan 11, 2026 cs.AI

GDEPO: 샘플 제약 강화 학습을 위한 향상된 훈련 데이터 활용 기반의 그룹 이중 동적 및 동등 권한 이점 정책 최적화

GDEPO: Group Dual-dynamic and Equal-right Advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

Xinyan Liu

Citations: 0

h-index: 0

Fan Guo

Citations: 23

h-index: 3

Kang Song

Citations: 0

h-index: 0

Zheng Yan

Citations: 8

h-index: 2

Yi Zhang

Citations: 212

h-index: 7

Junchen Wan

Citations: 86

h-index: 4

Yao Liu

Citations: 28

h-index: 3

Jihao Huang

Citations: 29

h-index: 3

Qi Liu

Citations: 764

h-index: 11

Chen Jia

Citations: 51

h-index: 3

자동 정리 증명(ATP)은 인공지능(AI)의 추론 능력을 평가하기 위해 Lean과 같은 형식 언어로 기계가 검증 가능한 증명을 구성해야 하는 AI의 근본적인 과제입니다. 강화 학습(RL), 특히 고성능의 GRPO(Group Relative Policy Optimization) 알고리즘은 이 작업을 위한 주류 접근 방식으로 부상했습니다. 그러나 ATP 시나리오에서 GRPO는 두 가지 중요한 문제에 직면합니다. 복합 보상이 사용될 때 상대적 이점(advantage) 추정이 형식 검증기(formal verifier)의 이진 피드백과 충돌할 수 있으며, 정적 샘플링 전략은 유효한 증명이 발견되지 않을 경우 전체 데이터 배치를 폐기하여 모델 업데이트에 기여하지 못하고 심각한 데이터 낭비를 초래할 수 있습니다. 이러한 한계를 해결하기 위해 우리는 세 가지 핵심 메커니즘을 통합한 GDEPO(Group Dual-dynamic and Equal-right-advantage Policy Optimization)를 제안합니다. 1) 유효한 증명이 발견될 때까지 유효하지 않은 배치를 다시 샘플링하는 동적 추가 샘플링, 2) 안정적이고 올바른 정책 업데이트를 보장하기 위해 이점 함수(advantage function)의 부호(정답 여부 기반)와 크기(보조 보상으로 조정됨)를 분리하는 동등 권한 이점(equal-right advantage), 3) 처음에는 실패했지만 결국 성공한 샘플에 추가적인 그래디언트 스텝을 적용하여 까다로운 케이스에 대한 학습을 가속화하는 동적 추가 반복이 그것입니다. 다양한 난이도의 세 가지 데이터셋(MinF2F-test, MathOlympiadBench, PutnamBench)에서 수행된 실험은 GDEPO의 효과를 확인하며, 소거 연구는 시너지 효과를 내는 구성 요소들의 필요성을 검증합니다. 제안된 방법은 데이터 활용도와 최적화 효율성을 향상시켜 ATP를 위한 새로운 훈련 패러다임을 제공합니다.

Original Abstract

Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!