2603.01501v1 Mar 02, 2026 cs.LG

GAC: 그래디언트 정렬 제어를 통한 LLM의 비동기 강화 학습 안정화

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

Junwei Su

Citations: 97

h-index: 5

Hao Xu

Citations: 22

h-index: 2

Yu Tian

Citations: 12

h-index: 1

Lansong Diao

Citations: 505

h-index: 7

Zhe Qian

Citations: 13

h-index: 1

Chuan Wu

Citations: 32

h-index: 3

비동기 실행은 대규모 언어 모델 및 AI 에이전트를 포함한 현대의 대규모 모델 워크로드에 강화 학습(RL)을 확장하는 데 필수적이지만, RL 최적화 동작을 근본적으로 변경할 수 있습니다. 기존의 비동기 RL 연구는 주로 학습 처리량과 분포 교정에 초점을 맞추고 있지만, 본 연구에서는 정책 그래디언트 업데이트에 비동기를 무분별하게 적용하면 질적으로 다른 학습 동역학이 발생하고 심각한 학습 불안정을 초래할 수 있음을 보여줍니다. 체계적인 실험적 및 이론적 분석을 통해, 우리는 이러한 불안정성의 주요 특징을 밝혀냈습니다. 비동기 학습은 연속적인 정책 그래디언트 간에 지속적으로 높은 코사인 유사도를 나타내는 반면, 동기화된 학습에서는 거의 직교하는 업데이트가 관찰됩니다. 이러한 정렬된 그래디언트 효과는 상관된 업데이트를 증폭시키고, 과도한 업데이트 및 발산을 증가시키는 위험을 높입니다. 이러한 관찰에 따라, 우리는 그래디언트 투영을 통해 비동기 RL 진행을 정렬된 방향으로 조절하는 간단한 동역학 기반 안정화 방법인 그래디언트 정렬 제어(GRADIENT ALIGNMENT CONTROL, GAC)를 제안합니다. 우리는 경계가 있는 지연 조건 하에서 수렴을 보장하며, 실험적으로 GAC가 안정적인 온-폴리시 학습 동역학을 회복하고 높은 지연 상태에서도 동기화된 기준 성능과 일치함을 입증했습니다.

Original Abstract

Asynchronous execution is essential for scaling reinforcement learning (RL) to modern large model workloads, including large language models and AI agents, but it can fundamentally alter RL optimization behavior. While prior work on asynchronous RL focuses on training throughput and distributional correction, we show that naively applying asynchrony to policy-gradient updates can induce qualitatively different training dynamics and lead to severe training instability. Through systematic empirical and theoretical analysis, we identify a key signature of this instability: asynchronous training exhibits persistently high cosine similarity between consecutive policy gradients, in contrast to the near-orthogonal updates observed under synchronized training. This stale-aligned gradient effect amplifies correlated updates and increases the risk of overshooting and divergence. Motivated by this observation, we propose GRADIENT ALIGNMENT CONTROL(GAC), a simple dynamics-aware stabilization method that regulates asynchronous RL progress along stale-aligned directions via gradient projection. We establish convergence guarantees under bounded staleness and demonstrate empirically that GAC recovers stable, on-policy training dynamics and matches synchronized baselines even at high staleness.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!