2602.15322v1 Feb 17, 2026 cs.LG

적응형 최적화 알고리즘에서 업데이트 마스킹의 놀라운 효과에 대하여

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Ming Zhang

Citations: 8,110

h-index: 5

Taejong Joo

Citations: 1

h-index: 1

Cheolmin Kim

Citations: 2,724

h-index: 3

Eugene Ie

Citations: 1

h-index: 1

Wenhan Xia

Citations: 336

h-index: 5

대규모 언어 모델(LLM)의 학습은 거의 예외 없이 정교한 전처리기를 갖춘 밀집적 적응형 최적화 알고리즘에 의존합니다. 본 논문에서는 무작위로 파라미터 업데이트를 마스킹하는 것이 매우 효과적임을 보여줌으로써 이러한 관행에 도전합니다. 마스킹된 RMSProp 변형은 꾸준히 최신 최고 성능의 최적화 알고리즘을 능가하는 것으로 나타났습니다. 분석 결과, 무작위 마스킹은 곡률에 의존적인 기하학적 정규화를 유도하여 최적화 경로를 부드럽게 만듭니다. 이러한 발견에 따라, 본 논문에서는 모멘텀-그라디언트 정렬을 사용하여 마스킹된 업데이트를 조절하는 Momentum-aligned gradient masking (Magma)을 소개합니다. 광범위한 LLM 사전 학습 실험 결과, Magma는 일관된 성능 향상을 제공하며 계산 오버헤드가 미미한 상태로 적응형 최적화 알고리즘을 대체할 수 있는 간단한 방법임을 확인했습니다. 특히, 10억 개의 파라미터를 가진 모델에서 Magma는 Adam 및 Muon에 비해 각각 19% 및 9% 더 낮은 퍼플렉시티를 달성했습니다.

Original Abstract

Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!