2604.25578v1 Apr 28, 2026 cs.CL

Marco-MoE: 효율적인 재활용을 통한 개방형 다국어 혼합 전문가 언어 모델

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

Yu Zhao

Citations: 196

h-index: 4

Longyue Wang

Citations: 169

h-index: 6

Tianqi Shi

Citations: 172

h-index: 3

Feihu Jiang

Citations: 61

h-index: 5

Chenyang Lyu

Citations: 522

h-index: 13

Fan Jiang

Citations: 11

h-index: 1

Yichao Du

Citations: 65

h-index: 2

Weihua Luo

Citations: 7

h-index: 1

본 논문에서는 완전히 개방형이며 다국어 지원을 하는 희소 혼합 전문가(MoE) 모델인 Marco-MoE를 소개합니다. Marco-MoE는 매우 희소한 구조를 가지며, 입력 토큰당 총 파라미터의 약 5%만이 활성화됩니다. 이러한 극단적인 희소성과, 기존 밀집 모델로부터의 재활용 기술을 결합하여, 5조 토큰 규모의 데이터로 효율적인 사전 학습을 수행할 수 있습니다. 저희 모델은 영어 및 다국어 벤치마크에서 유사한 크기의 경쟁 모델보다 뛰어난 성능을 보이며, 최고의 성능-계산 비율을 달성합니다. 또한, Marco-MoE 모델을 추가적으로 파인튜닝하여 Marco-MoE- extsc{Instruct} 변형 모델을 만들었으며, 이는 활성화된 파라미터 수가 3배에서 14배 더 많은 경쟁 모델보다 더 나은 성능을 보입니다. 분석 결과, Marco-MoE는 관련된 언어 간에 공유되는 구조화된 전문가 활성화 패턴을 학습하는 동시에, 언어적으로 고립된 언어에 대해서는 매우 전문적인 활용을 유지합니다. 또한, Marco-MoE는 기존 밀집 모델에서 흔히 발생하는 간섭 현상 없이 확장 가능한 언어 확장을 가능하게 합니다. 연구 결과를 공유하기 위해, 저희는 전체 학습 데이터셋, 레시피, 그리고 모델 가중치를 공개합니다.

Original Abstract

We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!