2603.11535v1 Mar 12, 2026 cs.AI

동적 연산 할당 및 부하 균형을 위한 자동 회귀 언어 모델링에서의 전문가 임계값 라우팅

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

Lichao Sun

Citations: 460

h-index: 4

Hanchi Sun

Citations: 784

h-index: 4

Yonghui Wu

Citations: 8

h-index: 2

Yixin Liu

Citations: 107

h-index: 3

토큰-선택 혼합 전문가(TC-MoE)는 각 토큰을 고정된 수의 전문가로 라우팅하며, 이는 동적 연산 할당을 제한하고 부하 균형을 유지하기 위한 추가적인 손실 함수를 필요로 합니다. 본 논문에서는 각 전문가가 전역 토큰 분포로부터 추정한 지수 이동 평균(EMA) 임계값을 유지하는 전문가 임계값(ET) 라우팅을 제안합니다. 학습 및 추론 과정에서, 각 토큰은 독립적으로 자신의 점수가 전문가의 임계값을 초과하는 경우 해당 전문가로 라우팅됩니다. 이를 통해 동적 연산 할당이 가능하며, 추가적인 손실 함수 없이 부하 균형을 달성할 수 있습니다. 이러한 완전한 인과 관계 기반 메커니즘은 배치 내 다른 토큰에 대한 의존성을 제거하여 자동 회귀 언어 모델링에 적합합니다. FineWeb-Edu 데이터셋을 사용하여 24억 개의 파라미터로 사전 학습 실험을 진행한 결과, ET는 TC-MoE보다 교차 엔트로피 손실을 0.067만큼 낮추었으며, 이는 1.6배 더 적은 토큰으로 동일한 성능을 달성하는 것과 같습니다.

Original Abstract

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.

2 Citations

0 Influential

2 Altmetric

12.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!