2605.05826v1 May 07, 2026 cs.AI

AGPO: 비검증 가능한 추론 및 JD 검색 광고 관련성을 위한 비대칭 그룹 정책 최적화

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Zhengru Fang

Citations: 473

h-index: 11

Kai Ming Ting

Citations: 183

h-index: 6

Yimin Deng

Citations: 16

h-index: 3

Yang Xu

Citations: 57

h-index: 4

Kun Yao

Citations: 5

h-index: 2

Ming Pang

Citations: 47

h-index: 4

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 성능을 향상시키는 데 상당한 성공을 거두었습니다. 그러나 최근 연구에 따르면 현재의 RLVR 방법은 올바른 경로에 대한 샘플링 효율성을 향상시키지만, 근본적으로 새로운 추론 패턴을 유도하지 못합니다. 오히려 훈련된 모델의 추론 능력 경계는 종종 기본 모델보다 좁아지며, 기본 모델은 큰 샘플 크기에서 더 높은 범위를 보입니다. 본 연구에서는 이러한 경계 축소를 방지하기 위해 비대칭 그룹 정책 최적화(AGPO)를 제안합니다. AGPO는 부정적인 지배적 강화 전략을 채택하여 잘못된 추론 경로를 억제하고, 기본 모델의 탐색 능력을 유지합니다. 긍정적인 강화의 경우, AGPO는 그룹 이점 메커니즘을 사용하여 그룹 내 분산을 기반으로 긍정적인 업데이트를 조정하므로, 모델은 희귀한 올바른 경로에 집중하는 동시에 사소한 경로에서의 업데이트를 억제할 수 있습니다. 우리는 다섯 가지 수학적 벤치마크에 대한 실험을 통해 AGPO가 최첨단 정확도를 달성하면서도 규모가 커질수록 pass@$k$ 성능을 지속적으로 향상시키는 것을 확인했습니다. 또한, 대규모 산업 환경에서 검색 광고 관련성 최적화를 위한 애플리케이션에서 AGPO는 데이터 주석의 품질을 효과적으로 향상시켜, 후속 학습 모델의 성능을 크게 향상시켰습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!