2601.17111v1 Jan 23, 2026 cs.LG

최소 부하 전문가 병렬 처리: 불균형 혼합 전문가 모델의 로드 밸런싱

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Xuan-Phi Nguyen

Citations: 877

h-index: 15

Caiming Xiong

Citations: 1,404

h-index: 15

Shafiq Joty

Citations: 816

h-index: 18

Austin Xu

Citations: 395

h-index: 11

Shrey Pandit

University of Texas at Austin

Citations: 254

h-index: 9

혼합 전문가(MoE) 모델은 일반적으로 통계적으로 균형 잡힌 전문가 라우팅을 보장하기 위해 명시적인 로드 밸런싱 제약 조건과 함께 사전 훈련됩니다. 그러나, 잘 훈련된 MoE 모델에서도 여전히 상당한 불균형 라우팅이 나타나는 것을 관찰합니다. 이러한 현상은 자연스러운 현상이며, 심지어 바람직한 현상일 수도 있습니다. 왜냐하면 불균형 라우팅은 모델이 특정 영역의 지식을 전문가 집합 내에 집중시키는 것을 가능하게 하기 때문입니다. 전문가 병렬 처리(EP)는 MoE 모델을 여러 장치에 걸쳐 분산시켜 확장하는 기술이지만, 균형 잡힌 라우팅이라는 가정하에 설계되었습니다. 극단적인 불균형 상태에서는 EP가 특정 수의 전문가에게 과도한 수의 토큰을 집중시켜, 사후 훈련 또는 추론 단계에서 과부하된 장치에서 계산 및 메모리 관련 오류를 발생시킬 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 과부하된 장치에서 과도한 토큰과 관련된 전문가 파라미터를, 활용도가 낮은 장치로 동적으로 재라우팅하는 새로운 EP 알고리즘인 최소 부하 전문가 병렬 처리(LLEP)를 제안합니다. 이를 통해 모든 장치가 최소 집단 지연 시간을 준수하면서 메모리 제약을 만족하는 범위 내에서 작업을 완료할 수 있습니다. 다양한 모델 규모에서 LLEP는 표준 EP에 비해 최대 5배의 속도 향상과 4배의 피크 메모리 사용량 감소를 달성합니다. 이를 통해 사후 훈련 및 추론의 속도와 처리량을 향상시킬 수 있으며, 특히 gpt-oss-120b 모델의 경우 약 1.9배 빠른 성능을 보입니다. 우리는 이 방법을 광범위한 이론적 분석과 종합적인 실험적 평가를 통해 뒷받침하며, 여기에는 제거 실험도 포함됩니다. 이러한 결과는 중요한 절충점을 밝히고, 최적의 성능을 달성하기 위한 하드웨어별 하이퍼파라미터 튜닝을 위한 체계적인 프레임워크를 제공합니다.

Original Abstract

Mixture-of-Experts (MoE) models are typically pre-trained with explicit load-balancing constraints to ensure statistically balanced expert routing. Despite this, we observe that even well-trained MoE models exhibit significantly imbalanced routing. This behavior is arguably natural-and even desirable - as imbalanced routing allows models to concentrate domain-specific knowledge within a subset of experts. Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices, but with a less-discussed assumption of balanced routing. Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures on overloaded devices during post-training or inference, where explicit load balancing is often inapplicable. We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones. This ensures that all devices complete their workloads within the minimum collective latency while respecting memory constraints. Across different model scales, LLEP achieves up to 5x speedup and 4x reduction in peak memory usage compared to standard EP. This enables faster and higher-throughput post-training and inference, with ~1.9x faster for gpt-oss-120b. We support our method with extensive theoretical analysis and comprehensive empirical evaluations, including ablation studies. These results illuminate key trade-offs and enable a principled framework for hardware-specific hyper-parameter tuning to achieve optimal performance.

1 Citations

0 Influential

9 Altmetric

46.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!