2604.15416v1 Apr 16, 2026 cs.LG

StoSignSGD: 편향되지 않은 구조적 확률성을 활용하여 SignSGD를 개선하고 대규모 언어 모델 학습에 적용

StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

Rui Pan

Citations: 419

h-index: 11

Yuxing Liu

UIUC

Citations: 120

h-index: 6

Dingzhi Yu

Citations: 34

h-index: 4

Tong Zhang

Citations: 117

h-index: 5

사인 기반 최적화 알고리즘인 SignSGD는 분산 학습 및 대규모 기초 모델 훈련에서 뛰어난 성능을 보여 주목받고 있습니다. 하지만 SignSGD는 ReLU, max-pooling, Mixture-of-Experts 등으로 인해 현대 머신러닝에서 흔히 나타나는 불연속적인 목적 함수에서 수렴하지 않는 문제가 있습니다. 이러한 근본적인 한계를 극복하기 위해, 우리는 구조적 확률성을 사인 연산에 주입하면서도 편향되지 않은 업데이트 단계를 유지하는 알고리즘인 extbf{StoSignSGD}를 제안합니다. (온라인) 볼록 최적화 환경에서, 이론적 분석 결과 StoSignSGD는 SignSGD의 수렴 문제를 엄격하게 해결하고, 최적의 수렴 속도를 달성하는 것으로 나타났습니다. 더욱 어려운 비볼록 비연속 최적화 문제에서는, 기존 정의를 포괄하는 일반화된 정지 상태 측정을 도입하여 StoSignSGD가 현재까지 알려진 복잡도 경계를 차원 단위로 개선한다는 것을 증명했습니다. 실험적으로, StoSignSGD는 다양한 대규모 언어 모델(LLM) 훈련 환경에서 안정성과 효율성을 입증했습니다. 특히, AdamW가 치명적으로 실패하는 저정밀 FP8 사전 훈련 환경에서 StoSignSGD는 높은 안정성을 유지하며 기존 방법 대비 1.44배에서 2.14배의 속도 향상을 보였습니다. 또한, 7B LLM을 수학적 추론 작업에 대해 미세 조정할 때, StoSignSGD는 AdamW와 SignSGD 모두보다 상당한 성능 향상을 제공합니다. 마지막으로, StoSignSGD의 성공 요인을 분석하기 위해, 모든 일반 최적화 알고리즘을 편향되지 않은 사인 기반 알고리즘으로 변환할 수 있는 프레임워크를 개발했습니다. 이 프레임워크를 사용하여 StoSignSGD의 핵심 구성 요소를 분석하고, 알고리즘 설계 선택 사항을 실증적으로 검증하기 위한 포괄적인 분석 연구를 수행했습니다.

Original Abstract

Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbf{StoSignSGD}, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass prior definitions, proving that StoSignSGD improves upon the best-known complexity bounds by dimensional factors. Empirically, StoSignSGD exhibits robust stability and superior efficiency across diverse large language model (LLM) training regimes. Notably, in low-precision FP8 pretraining -- a setting where AdamW fails catastrophically -- StoSignSGD remains highly stable and yields a remarkable 1.44$\times$ to 2.14$\times$ speedup relative to established baselines. Furthermore, when fine-tuning 7B LLMs on mathematical reasoning tasks, StoSignSGD delivers substantial performance gains over both AdamW and SignSGD. Finally, to dissect the mechanisms driving its success, we develop a sign conversion framework capable of transforming any general optimizer into its unbiased, sign-based counterpart. Utilizing this framework, we deconstruct the core components of StoSignSGD and present a comprehensive ablation study to empirically validate our algorithmic design choices.

2 Citations

0 Influential

5.5 Altmetric

29.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!