2604.19147v1 Apr 21, 2026 cs.LG

Nexusformer: 안정적이고 상속 가능한 트랜스포머 확장을 위한 비선형 어텐션 확장

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

Simon Wu

Citations: 4

h-index: 1

Weijie Zhao

Citations: 92

h-index: 6

Mingquan Liu

Citations: 28

h-index: 2

Bolun Wang

Citations: 133

h-index: 4

Nuobei Xie

Citations: 79

h-index: 5

Rui Zhu

Citations: 3

h-index: 1

Pengyang Zhou

Citations: 79

h-index: 6

트랜스포머 모델의 확장은 일반적으로 처음부터 더 큰 모델을 학습해야 하는 경우가 많습니다. 이는 기존 아키텍처가 학습된 표현을 버리지 않고 확장하는 데 어려움을 겪기 때문입니다. 우리는 어텐션 메커니즘의 선형 투영에서 발생하는 주요 병목 현상을 발견했습니다. 이 선형 투영은 특징 추출을 고정된 차원의 부분 공간으로 제한하여 표현력과 점진적인 확장 능력을 제한합니다. 이를 해결하기 위해, 우리는 선형 $Q/K/V$ 투영을 Nexus-Rank 레이어로 대체하는 Nexusformer를 제안합니다. Nexus-Rank 레이어는 점진적으로 더 높은 차원의 공간에서 작동하는 이중 활성화에 의해 구동되는 세 단계의 비선형 매핑을 사용합니다. 이 설계는 선형성 제약을 극복하고 손실 없는 구조적 성장을 가능하게 합니다. 사전 학습된 지식을 보존하는 동시에 초기화되지 않은 블록을 통해 두 축을 따라 새로운 용량을 주입할 수 있습니다. 언어 모델링 및 추론 벤치마크 실험 결과, Nexusformer는 점진적인 확장을 수행하는 동안 최대 41.5% 적은 학습 컴퓨팅 자원을 사용하여 Tokenformer의 퍼플렉시티에 일치하는 성능을 보였습니다 (240M에서 440M). 또한, 성장 동역학에 대한 분석 결과, 초기화는 안정적인 수렴 경로를 유도하며, 이를 통해 확장 수준에 따른 성능을 정확하게 예측하는 기하학적 확장 법칙을 도출할 수 있었습니다.

Original Abstract

Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism's linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces. This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer's perplexity using up to 41.5\% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!