2602.08064v1 Feb 08, 2026 cs.LG

SiameseNorm: 사전/사후 정규화를 조화시키는 장벽을 허물다

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Tianyu Li

Citations: 70

h-index: 3

Dongchen Han

Citations: 1,233

h-index: 10

Zixuan Cao

Citations: 5

h-index: 2

Haofeng Huang

Citations: 148

h-index: 3

Mengyu Zhou

Citations: 6

h-index: 2

Ming Chen

Citations: 3

h-index: 1

Erchao Zhao

Citations: 7

h-index: 2

Xiaoxi Jiang

Citations: 39

h-index: 3

Guanjun Jiang

Citations: 37

h-index: 3

Gao Huang

Citations: 5

h-index: 2

현대적인 트랜스포머는 최적화 안정성을 위해 주로 사전 정규화(Pre-Norm) 방식을 채택하며, 불안정하지만 잠재력이 더 큰 사후 정규화(Post-Norm) 아키텍처의 장점을 포기하는 경향이 있습니다. 이 두 방식을 결합하려는 이전의 시도들은 종종 안정성과 성능 사이의 균형 문제를 야기했습니다. 우리는 이러한 현상이 단일 스트림 설계 내에서의 구조적 비호환성 때문이라고 판단했습니다. 사후 정규화 연산을 적용하는 것은 필연적으로 사전 정규화가 유지하는 깨끗한 항등 행렬(identity gradient)을 방해합니다. 이러한 문제점을 근본적으로 해결하기 위해, 우리는 사전 정규화와 사후 정규화의 장점을 결합한 두 스트림 아키텍처인 SiameseNorm을 제안합니다. 이 설계는 두 스트림의 최적화 동역학을 분리하여, 모든 잔차 블록이 사전 정규화와 사후 정규화 모두에서 상속받은 결합된 기울기를 받도록 함으로써, 하나의 스트림은 안정성을 확보하고 다른 스트림은 표현력을 향상시킵니다. 13억 개의 파라미터를 가진 모델에 대한 광범위한 사전 훈련 실험 결과, SiameseNorm은 뛰어난 최적화 견고성을 보여주며, 강력한 기준 모델보다 일관되게 우수한 성능을 보였습니다. 관련 코드는 https://github.com/Qwen-Applications/SiameseNorm 에서 확인할 수 있습니다.

Original Abstract

Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

3 Citations

2 Influential

35.986122886681 Altmetric

186.9 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!