2602.07729v2 Feb 07, 2026 cs.LG

우린 Adam이 정말 필요할까요? LLM에서 SGD를 사용한 놀라울 정도로 강력하고 희소한 강화 학습

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Dilek Hakkani-Tur

Citations: 703

h-index: 12

Sagnik Mukherjee

University of Illinois, Urbana Champaign

Citations: 316

h-index: 6

Lifan Yuan

Citations: 619

h-index: 6

Pavan Jayasinha

Citations: 4

h-index: 1

Hao Peng

Citations: 1,124

h-index: 7

강화 학습(RL), 특히 검증 가능한 보상을 이용한 강화 학습(RLVR)은 대규모 언어 모델(LLM) 훈련의 중요한 단계이며, 현재 확장 노력의 핵심 초점입니다. 그러나 강화 학습에서의 최적화 방법은 일반적으로 사전 훈련 및 지도 미세 조정과 같은 단계에서 사용되는 방법들을 따르는데, 이는 최근 연구에서 강조된 강화 학습과 이러한 단계 간의 근본적인 차이에도 불구하고 나타나는 현상입니다. 이러한 방법 중 하나는 AdamW 옵티마이저의 사용인데, 이는 높은 메모리 오버헤드를 가지고 있음에도 불구하고 대규모 트랜스포머 훈련에 널리 사용됩니다. 우리의 분석에 따르면, AdamW에서 사용되는 모멘텀과 적응형 학습률은 강화 학습에서 지도 미세 조정만큼 큰 영향을 미치지 않으며, 따라서 강화 학습은 Adam 스타일의 매개변수별 적응형 학습률 및 모멘텀으로부터 덜 이점을 얻을 것이라는 가설을 세웠습니다. 이 가설을 뒷받침하는 실험 결과, 지도 학습에서 성능이 좋지 않은 것으로 알려진, 훨씬 더 메모리 효율적인 SGD가 LLM의 강화 학습에서 AdamW와 동등하거나 더 나은 성능을 보이는 것으로 나타났습니다. 놀랍게도, SGD를 사용한 전체 미세 조정은 0.02% 미만의 모델 매개변수만 업데이트하는데, 이는 AdamW보다 1000배 이상 적은 수치입니다. 우리의 분석은 이러한 업데이트 희소성의 가능한 이유를 제시합니다. 이러한 결과는 LLM에서의 강화 학습 최적화 역학에 대한 새로운 통찰력을 제공하며, 강화 학습이 이전보다 훨씬 더 효율적으로 매개변수를 사용할 수 있음을 보여줍니다.

Original Abstract

Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.

4 Citations

0 Influential

6 Altmetric

34.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!