2601.22580v1 Jan 30, 2026 cs.CL

SpanNorm: 심층 트랜스포머 모델의 안정적인 학습과 성능 향상을 위한 새로운 방법

SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

Peng Pei

Citations: 258

h-index: 10

Bei Li

Citations: 56

h-index: 4

Xin Chen

Citations: 45

h-index: 3

Jingang Wang

Citations: 391

h-index: 10

Xunliang Cai

Citations: 566

h-index: 13

Xinyu Liu

Citations: 17

h-index: 3

Tong Xiao

Citations: 333

h-index: 10

Chao Wang

Citations: 17

h-index: 2

Jiaqi Zhang

Citations: 46

h-index: 3

Yuchun Fan

Citations: 124

h-index: 6

Linkun Lyu

Citations: 11

h-index: 1

대규모 언어 모델(LLM)의 성공은 심층 트랜스포머 아키텍처의 안정적인 학습에 달려 있습니다. 정규화 레이어의 위치는 중요한 설계 결정 요소이며, 근본적인 상충 관계를 야기합니다. "PreNorm" 아키텍처는 심층 모델에서 잠재적인 성능 저하를 감수하는 대신 학습 안정성을 보장하는 반면, "PostNorm" 아키텍처는 뛰어난 성능을 제공하지만 심각한 학습 불안정성을 겪습니다. 본 연구에서는 이러한 딜레마를 해결하기 위해 두 가지 접근 방식의 장점을 통합한 새로운 기술인 SpanNorm을 제안합니다. 구조적으로 SpanNorm은 전체 트랜스포머 블록에 걸쳐 깨끗한 잔차 연결을 구축하여 신호 전파를 안정화시키고, 동시에 PostNorm 스타일의 계산을 사용하여 집계된 출력을 정규화하여 모델 성능을 향상시킵니다. SpanNorm과 함께 원칙적인 스케일링 전략을 사용하면 네트워크 전체에서 신호의 분산을 제한하여 PostNorm 모델에서 발생하는 기울기 문제를 방지하고 PreNorm의 표현 붕괴 현상을 완화할 수 있다는 이론적 분석을 제공합니다. 실험적으로 SpanNorm은 밀집 모델과 Mixture-of-Experts(MoE) 시나리오 모두에서 기존의 정규화 방식보다 일관되게 우수한 성능을 보이며, 더욱 강력하고 안정적인 트랜스포머 아키텍처 개발의 길을 열어줍니다.

Original Abstract

The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.

1 Citations

1 Influential

6.5 Altmetric

35.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!