2602.03001v1 Feb 03, 2026 cs.LG

확률적 부호 경사 및 스펙트럼 경사법을 위한 비유클리드적 기울기 잡음 스케일 기반 적응적 배치 크기

Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent

Irina Rish

Citations: 604

h-index: 7

H. Shi

Citations: 76

h-index: 3

Hiroki Naganuma

Citations: 54

h-index: 4

Shagun Gupta

Citations: 54

h-index: 3

Youssef Briki

Citations: 0

h-index: 0

I. Mitliagkas

Citations: 47

h-index: 1

Parameswaran Raman

Citations: 16

h-index: 3

최신 머신러닝 시스템은 일반적으로 하드웨어 활용도를 극대화하기 위해 큰 크기의 고정 배치 크기 또는 수동으로 조정된 배치 크기 스케줄을 사용하며, 이는 취약하고 조정 비용이 많이 드는 휴리스틱에 의존합니다. 기울기 잡음 스케일(GNS)을 기반으로 하는 기존의 적응적 전략은 이러한 대안을 제공합니다. 그러나 이러한 전략은 SGD의 유클리드 기하학적 특성을 가정하며, 이는 부호 경사(signSGD / Signum ($\ell_\infty$)) 및 확률적 스펙트럼 경사(specSGD) / Muon ($\mathcal{S}_\infty$)과 같은 일반화된 norm 기반의 인기 있는 최적화 알고리즘과의 근본적인 불일치를 초래합니다. 본 연구에서는 부호 경사 및 확률적 스펙트럼 경사에 대한 기울기 잡음 스케일을 도출합니다. 이러한 스케일은 각 최적화 알고리즘의 이중 norm 기하학적 특성으로부터 자연스럽게 파생됩니다. 이러한 비유클리드적 지표를 실용적으로 추정하기 위해, 분산 데이터 병렬 시스템에서 다양한 rank의 로컬 미니 배치 기울기를 활용하는 효율적인 분산 추정 절차를 제안합니다. 실험 결과, 비유클리드적 GNS를 사용하는 적응적 배치 크기 전략은 1억 6천만 개의 파라미터를 가진 Llama 모델에서 Signum 및 Muon의 경우 최대 66%까지 학습 단계를 줄이면서, 고정 배치 크기를 사용하는 기준 모델과 동일한 검증 손실을 달성할 수 있음을 보여줍니다.

Original Abstract

To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD's Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ($\ell_\infty$) and stochastic spectral descent (specSGD) / Muon ($\mathcal{S}_\infty$). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160 million parameter Llama model.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!