2603.27624v1 Mar 29, 2026 cs.AR

전문 스트리밍: 멀티-칩렛 아키텍처 및 동적 전문가 경로 스케줄링을 통한 저 배치 MoE 추론 가속화

Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling

Yonghao Tan

Citations: 32

h-index: 2

Pingcheng Dong

Citations: 199

h-index: 6

Songchen Ma

Citations: 352

h-index: 7

Hongyi Li

Citations: 3

h-index: 1

Weihao Zhang

Citations: 113

h-index: 5

Yu Liu

Citations: 33

h-index: 2

Lanxin Liu

Citations: 1

h-index: 1

Yuzhong Jiao

Citations: 7

h-index: 2

Xuejiao Liu

Citations: 20

h-index: 2

Luhong Liang

Citations: 71

h-index: 5

Kwang-Ting Cheng

Citations: 1

h-index: 1

혼합 전문가(MoE)는 저 배치 추론을 위한 엣지 AI 분야에서 유망한 접근 방식입니다. 그러나 온 장치 배포에서는 종종 제한된 온칩 메모리와 심각한 워크로드 불균형 문제가 발생하며, 오프로딩을 사용하는 경우 추가적으로 오프칩 메모리 접근 병목 현상이 발생합니다. 또한, MoE의 희소성 및 동적 게이팅은 분산 전략을 훨씬 더 미세한 수준으로 이동시키고 런타임 스케줄링 고려 사항을 도입합니다. 최근, 높은 대역폭의 칩렛 간 연결은 워크로드 불균형 및 오프로딩 병목 현상을 미세한 스케줄링을 통해 해결할 수 있는 멀티-칩렛 시스템에 새로운 기회를 제공합니다. 본 논문에서는 저 배치 MoE 추론을 위한 멀티-칩렛 가속기에 특화된 병렬화 패러다임인 Fully Sharded Expert Data Parallelism(FSE-DP)을 제안합니다. FSE-DP는 고대역폭 D2D 링크를 통해 미세하게 조정된 상호 보완적인 전문가 스트림을 동적 경로에 따라 배치하여 적응적인 연산-통신 중첩 및 균형 잡힌 로드를 달성합니다. 이러한 복잡한 데이터 흐름은 최소한의, 하드웨어에 적합한 가상화 규칙과 경량 스케줄링 알고리즘을 통해 제어됩니다. 제안하는 방식은 최첨단 기준 대비 1.22배에서 2.00배의 속도 향상을 달성하고 온칩 메모리 사용량을 최대 78.8% 절감합니다.

Original Abstract

Mixture-of-Experts is a promising approach for edge AI with low-batch inference. Yet, on-device deployments often face limited on-chip memory and severe workload imbalance; the prevalent use of offloading further incurs off-chip memory access bottlenecks. Moreover, MoE sparsity and dynamic gating shift distributed strategies toward much finer granularity and introduce runtime scheduling considerations. Recently, high die-to-die bandwidth chiplet interconnects have created new opportunities for multi-chiplet systems to address workload imbalance and offloading bottlenecks with fine-grained scheduling. In this paper, we propose Fully Sharded Expert Data Parallelism, a parallelization paradigm specifically architected for low-batch MoE inference on multi-chiplet accelerators. FSE-DP attains adaptive computation-communication overlap and balanced load by orchestrating fine-grained, complementary expert streams along dynamic trajectories across high-bandwidth D2D links. The attendant dataflow complexity is tamed by a minimal, hardware-amenable set of virtualization rules and a lightweight scheduling algorithm. Our approach achieves 1.22 to 2.00 times speedup over state-of-the-art baselines and saves up to 78.8 percent on-chip memory.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!