2603.10379v1 Mar 11, 2026 cs.LG

혼합 전문가 모델에서의 최적 전문가-어텐션 할당: 동적 모델 설계를 위한 확장 가능한 법칙

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Junzhuo Li

Citations: 34

h-index: 4

Xuming Hu

Citations: 20

h-index: 3

Changxin Tian

Citations: 161

h-index: 6

Peijie Jiang

Citations: 72

h-index: 4

Zhiqiang Zhang

Citations: 54

h-index: 4

Jia Liu

Citations: 77

h-index: 3

본 논문은 혼합 전문가(MoE) 모델에 대한 신경망 확장 법칙의 새로운 확장판을 제시하며, 전문가 및 어텐션 서브 레이어 간의 최적 컴퓨팅 할당에 초점을 맞춥니다. MoE 아키텍처는 모델 용량을 효율적으로 확장하는 방법으로 등장했으며, 최적의 전문가-어텐션 컴퓨팅 비율을 결정하는 것이 중요합니다. 우리는 전체 FLOPs 중 토큰당 전문가 레이어와 어텐션 레이어에 할당되는 비율을 $r$로 정의하고, 이 비율이 전체 컴퓨팅 예산 및 모델 희소성과 어떻게 상호 작용하는지 탐구합니다. GPT 스타일의 MoE 트랜스포머에 대한 광범위한 실험을 통해, 최적 비율 $r^*$가 전체 컴퓨팅량에 대한 거듭제곱 법칙 관계를 따르며, 희소성에 따라 달라진다는 것을 경험적으로 확인했습니다. 분석 결과, $r^*$에 대한 명시적인 공식을 도출하여 전문가-어텐션 컴퓨팅 할당에 대한 정밀한 제어가 가능합니다. 본 연구는 Chinchilla 확장 법칙을 이 아키텍처 파라미터를 통합하여 일반화함으로써, 모델 크기 및 데이터 외에 MoE 모델 튜닝을 위한 새로운 프레임워크를 제공합니다. 연구 결과는 효율적인 MoE 모델을 설계하기 위한 실질적인 지침을 제공하며, 고정된 컴퓨팅 예산을 준수하면서 성능을 최적화하는 데 도움이 됩니다.

Original Abstract

This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^*$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^*$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!