2603.06350v1 Mar 06, 2026 cs.DC

MoEless: 서버리스 컴퓨팅을 통한 효율적인 MoE LLM 서비스

MoEless: Efficient MoE LLM Serving via Serverless Computing

Hao Wang

Citations: 838

h-index: 13

Hanfei Yu

Citations: 403

h-index: 11

Bei Ouyang

Citations: 77

h-index: 4

Shwai He

University of Maryland, College Park

Citations: 781

h-index: 13

Shwai He

Citations: 9

h-index: 2

Ang Li

Citations: 166

h-index: 6

대규모 언어 모델(LLM)은 콘텐츠 생성, 검색 및 추천 시스템, AI 기반 워크플로우 등 다양한 분야에서 AI 발전의 핵심 동력으로 자리 잡았습니다. 극심한 학습 비용을 줄이고 모델 규모를 확장하기 위해 Mixture-of-Experts (MoE)는 현대 LLM의 주요 기반 기술로 사용되며, 일반적으로 Expert Parallelism (EP)을 활용한 분산 방식으로 제공됩니다. 그러나 MoE의 희소 활성화 메커니즘은 심각한 전문가(expert) 부하 불균형을 초래합니다. 일부 전문가에게 과도한 부하가 집중되는 반면, 다른 전문가들은 유휴 상태로 남아 성능 저하를 야기하며, 이는 추론 지연 시간을 증가시키고 서비스 비용을 상승시킵니다. 기존의 전문가 부하 균형 솔루션은 서버 기반 인프라의 정적 리소스 구성을 가정하며, 전문가의 확장성과 유연성을 제한하고, 비용이 많이 드는 실시간 전문가 교체 또는 생성 품질 저하를 초래합니다. 본 논문에서는 전문가 부하 불균형을 완화하고 서버리스 전문가를 활용하여 추론 속도를 가속화하는 최초의 서버리스 MoE 서비스 프레임워크인 MoEless를 제시합니다. MoEless는 경량화된 레이어 인지 예측기를 사용하여 예상되는 전문가 부하 분포를 정확하게 추정하고, 잠재적인 성능 저하 요소를 사전에 식별합니다. 또한, 함수 지역성을 극대화하고 GPU 활용률을 향상시키며 전문가 및 GPU 간의 부하 균형을 유지하기 위한 최적화된 전문가 확장 및 배치 전략을 설계했습니다. MoEless는 Megatron-LM을 기반으로 프로토타입을 제작하고 8개의 GPU로 구성된 테스트 환경에 배포되었습니다. 오픈 소스 MoE 모델과 실제 워크로드를 사용한 실험 결과, MoEless는 최첨단 솔루션과 비교하여 추론 지연 시간을 43% 단축하고 추론 비용을 84% 절감했습니다.

Original Abstract

Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware predictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, improve GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference latency by 43% and inference cost by 84% compared to state-of-the-art solutions.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!