2602.13498v1 Feb 13, 2026 cs.LG

TrasMuon: 신뢰 영역 기반 적응적 스케일링을 이용한 직교화된 모멘텀 최적화 알고리즘

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

Peng Cheng

Citations: 415

h-index: 7

Boxing Chen

Citations: 45

h-index: 2

Liheng Ma

Citations: 52

h-index: 2

Yufei Cui

Citations: 51

h-index: 4

Yingxue Zhang

Citations: 36

h-index: 3

Ming Jian

Citations: 3

h-index: 1

Wen Tong

Citations: 114

h-index: 4

Qingna Li

Citations: 20

h-index: 2

Jiucheng Zang

Citations: 2

h-index: 1

Muon 방식의 최적화 알고리즘은 뉴턴-슐츠(NS) 반복을 사용하여 업데이트를 직교화하며, 이는 종종 Adam 계열 방법보다 더 나은 업데이트 형상을 제공합니다. 그러나 이러한 직교화는 크기 정보를 버리기 때문에 학습 과정이 스텝 크기 하이퍼파라미터에 민감하게 반응하고, 고에너지 burst에 취약해지는 경향이 있습니다. 이를 완화하기 위해, 본 논문에서는 신뢰 영역 기반 적응적 스케일링을 적용한 Muon 알고리즘인 TrasMuon을 제안합니다. TrasMuon은 Muon의 거의 등방적인 형상을 유지하면서 (i) 전역 RMS 보정 및 (ii) 에너지 기반 신뢰 영역 클리핑을 통해 크기를 안정화합니다. 적응적 스케일링을 재도입하면 최적화 효율이 향상되는 반면, 일반적으로 고에너지 이상치로 인해 불안정성이 심화되는 경향이 있다는 것을 확인했습니다. TrasMuon은 상대 에너지 비율을 기반으로 신뢰 영역을 정의하여 업데이트를 안정적인 영역으로 제한함으로써 이러한 문제를 해결합니다. 컴퓨터 비전 및 자연어 처리 모델에 대한 실험 결과, TrasMuon은 기존 알고리즘보다 더 빠르게 수렴함을 보여줍니다. 또한, 초기 학습 단계를 제외한 실험에서 TrasMuon의 우수한 안정성과 견고함을 확인할 수 있었습니다.

Original Abstract

Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.

3 Citations

1 Influential

3.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!