2601.21204v2 Jan 29, 2026 cs.CL

임베딩 확장이 전문가 확장보다 언어 모델에서 더 우수한 성능을 보인다

Scaling Embeddings Outperforms Scaling Experts in Language Models

Peng Pei

Citations: 152

h-index: 8

Xunliang Cai

Citations: 118

h-index: 7

Fengcun Li

Citations: 71

h-index: 5

Rumei Li

Citations: 73

h-index: 5

Y. Qian

Citations: 92

h-index: 6

Hong Liu

Citations: 261

h-index: 2

Xing Hu

Citations: 232

h-index: 8

Chao Wang

Citations: 9

h-index: 2

Jiaqi Zhang

Citations: 26

h-index: 3

Linkun Lyu

Citations: 3

h-index: 1

Jiaqi Sun

Citations: 34

h-index: 3

Xurui Yang

Citations: 36

h-index: 2

Bo Wang

Citations: 45

h-index: 4

Lingtong Si

Citations: 36

h-index: 2

Yerui Sun

Citations: 88

h-index: 5

Yuchen Xie

Citations: 99

h-index: 5

혼합 전문가(MoE) 아키텍처는 대규모 언어 모델에서 희소성 확장의 표준으로 자리 잡았지만, 최근에는 효율성이 점차 감소하고 시스템 수준의 병목 현상이 발생하고 있습니다. 본 연구에서는 임베딩 확장을 희소성 확장을 위한 강력하고 독립적인 방법으로 탐구합니다. 종합적인 분석과 실험을 통해, 임베딩 확장이 특정 조건에서 전문가 확장에 비해 더 우수한 성능을 보이는 영역을 확인했습니다. 우리는 매개변수 할당부터 모델의 폭과 깊이와의 상호 작용에 이르기까지, 이러한 효율성을 결정하는 중요한 아키텍처 요소를 체계적으로 분석했습니다. 또한, 맞춤형 시스템 최적화 및 추론 디코딩 기술을 적용하여 이러한 희소성을 실제 추론 속도 향상으로 전환했습니다. 이러한 통찰력을 바탕으로, 685억 개의 매개변수를 가진 LongCat-Flash-Lite 모델을 처음부터 학습시켰으며, 약 30억 개의 매개변수를 임베딩에 할당했습니다. LongCat-Flash-Lite는 임베딩에 30억 개 이상의 매개변수를 할당했음에도 불구하고, 매개변수 수가 동일한 MoE 기반 모델보다 우수한 성능을 보일 뿐만 아니라, 비슷한 규모의 기존 모델과 경쟁력이 뛰어나며, 특히 에이전트 및 코딩 분야에서 뛰어난 성능을 보입니다.

Original Abstract

While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

4 Citations

0 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!