2602.04417v1 Feb 04, 2026 cs.LG

EMA 정책 그래디언트: EMA 앵커 및 Top-k KL을 활용한 LLM 강화 학습 개선

EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL

Citations: 835

h-index: 8

Citations: 1,488

h-index: 5

강화 학습(RL)은 대규모 언어 모델(LLM)이 점점 더 복잡한 추론 및 에이전트 행동을 습득하도록 돕습니다. 본 연구에서는 LLM을 위한 정책 그래디언트 알고리즘을 개선하기 위한 두 가지 간단한 방법을 제안합니다. 첫째, RL 과정에서 고정된 앵커 정책을 지터네트워크와 유사한 지수 이동 평균(EMA)으로 대체합니다. 둘째, 정확한 KL 발산과 샘플링된 KL 발산 사이의 유연한 보간을 가능하게 하는 Top-k KL 추정기를 도입합니다. EMA 앵커를 사용하는 경우의 안정성 조건을 도출했으며, 또한 우리의 Top-k KL 추정기가 모든 k 값에서 편향되지 않은 KL 값과 편향되지 않은 그래디언트를 제공하며, 정확한 KL 발산의 이점을 가져다준다는 것을 보였습니다. GRPO와 결합하여 사용하면, 이 두 가지 기술(EMA-PG)은 성능을 크게 향상시킵니다. 수학적 추론 분야에서, R1-distilled Qwen-1.5B가 OlympiadBench에서 50.8%를 기록한 GRPO보다 53.9%를 달성했습니다. 에이전트 RL 도메인에서는, Qwen-3B 모델을 기반으로 EMA-PG가 7개의 검색 엔진 기반 질의응답 데이터 세트에 걸쳐 GRPO를 평균 33.3% 향상시켰으며, HotpotQA에서 29.7%에서 44.1%로, 2WikiMultiHopQA에서 27.4%에서 40.1%로 향상되었습니다. 전반적으로, EMA-PG는 LLM을 위한 강화 학습을 확장하는 간단하고, 체계적이며, 강력한 접근 방식임을 보여줍니다. 코드: https://github.com/LunjunZhang/ema-pg

Original Abstract

Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% $\rightarrow$ 44.1% on HotpotQA, 27.4% $\rightarrow$ 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema-pg

3 Citations

1 Influential

34.986122886681 Altmetric

179.9 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!