2602.11185v1 Jan 30, 2026 cs.LG

Spectra: LLM 최적화 기법 재고 – 스펙트럴 이방성 환경에서의 최적화

Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy

Ruijun Huang

Citations: 16

h-index: 3

Fang Dong

Citations: 34

h-index: 4

Zhendong Huang

Citations: 138

h-index: 5

Anrui Chen

Citations: 17

h-index: 3

Mengyi Chen

Citations: 25

h-index: 4

Yifeng Yang

Citations: 16

h-index: 2

Mingzhi Dong

Citations: 56

h-index: 5

Yujiang Wang

Citations: 51

h-index: 5

Qin Lv

Citations: 57

h-index: 5

Robert P. Dick

Citations: 73

h-index: 5

Yuan Cheng

Citations: 8

h-index: 2

Li Shang

Citations: 10

h-index: 2

Xin Zhang

Citations: 46

h-index: 4

Hengjie Cao

Citations: 16

h-index: 3

Jinlong Hou

Citations: 45

h-index: 4

Fan Yang

Citations: 39

h-index: 5

T. Lu

Citations: 79

h-index: 5

LLM 학습 시 발생하는 기울기 신호는 높은 이방성을 보입니다. 언어적 구조는 에너지 대부분을 소수의 지배적인 스펙트럴 방향에 집중시키는 반면, 문맥 정보는 긴 꼬리 부분에 존재합니다. 본 연구에서는 이러한 지배적인 부분과 꼬리 부분의 분리가 학습 과정 전반에 걸쳐 유지됨을 확인했으며, 지배적인 부분이 전체 방향의 약 1.5%에 불과하지만 최적화 기법의 통계에 큰 영향을 미치는 것을 보였습니다. 이러한 지배적인 부분의 영향은 2차 모멘트 정규화를 통해 꼬리 부분의 업데이트를 억제하고, 전체적으로 안정적인 학습률 범위를 제한합니다. 이러한 분석을 바탕으로, 지배적인 부분을 억제하면서 노이즈에 민감한 스펙트럴 꼬리 부분을 증폭시키지 않는 'Spectra'라는 새로운 최적화 기법을 제안합니다. Spectra는 캐시된, 초기화된 파워 반복 기법을 통해 지배적인 부분(스파이크 부분 공간)을 추적하고, 낮은 순위의 스펙트럴 형상 변환을 적용하여 미미한 오버헤드와 크게 줄어든 최적화 상태 메모리를 유지합니다. LLaMA3 8B 모델을 50B 토큰으로 학습했을 때, Spectra는 AdamW보다 30% 더 빠르게 목표 손실에 도달하고, 단계별 전체 오버헤드를 0.7% 줄이며, 최적화 상태 메모리를 49.25% 감소시키고, 평균 다운스트림 정확도를 1.62% 향상시켰습니다. Muon과 비교했을 때, Spectra는 최적화 처리 시간에서 5.1배 빠르고, 더 낮은 최종 손실을 달성하며, 평균 정확도를 0.66% 향상시켰습니다.

Original Abstract

Gradient signals in LLM training are highly anisotropic: recurrent linguistic structure concentrates energy into a small set of dominant spectral directions, while context specific information resides in a long tail. We show that this spike tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second moment normalization and tightening the globally stable learning rate bound. Motivated by this analysis, we propose Spectra, a spike aware optimizer that suppresses the dominant low rank spike subspace without amplifying the noise sensitive spectral tail. Spectra tracks the spike subspace via cached, warm started power iteration and applies low rank spectral shaping with negligible overhead and substantially reduced optimizer state memory. On LLaMA3 8B trained on 50B tokens, Spectra reaches the same target loss 30% faster than AdamW, reduces per step end to end overhead by 0.7%, cuts optimizer state memory by 49.25%, and improves average downstream accuracy by 1.62%. Compared to Muon, Spectra is 5.1x faster in optimizer processing time, achieves a lower final loss, and improves average accuracy by 0.66%.

1 Citations

1 Influential

2.5 Altmetric

15.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!