2603.13606v1 Mar 13, 2026 cs.DC

NCCL EP: NCCL을 위한 통합된 전문가 병렬 통신 API 개발

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Xiaofan Li

Citations: 6

h-index: 2

N. Boker

Citations: 5

h-index: 1

Maayan Sheraizin

Citations: 4

h-index: 1

Nimrod Admoni

Citations: 4

h-index: 1

A. Polyakov

Citations: 60

h-index: 4

Subhadeep Bhattacharya

Citations: 319

h-index: 10

Fangfei Yu

Citations: 14

h-index: 3

Kai Sun

Citations: 127

h-index: 4

Georgios Theodorakis

Citations: 95

h-index: 5

Peter-Jan Gootzen

Citations: 38

h-index: 4

Aamir Shafi

Citations: 10

h-index: 2

Assaf Ravid

Citations: 4

h-index: 1

S. D. Girolamo

Citations: 1,116

h-index: 16

Manjunath Gorentla Venkata

Citations: 550

h-index: 12

Gil Bloch

Citations: 22

h-index: 3

James Dinan

Citations: 1,965

h-index: 25

A. Gol'dman

Citations: 6

h-index: 2

Hsin-Chung Yin

Citations: 7

h-index: 2

Mixture-of-Experts (MoE) 아키텍처는 대규모 언어 모델 확장에 필수적이며, DeepEP, Hybrid-EP 등과 같은 특수 장치 기반 통신 라이브러리 개발을 촉진하고 있습니다. 이러한 라이브러리는 MoE 분산 및 결합 작업에 GPU 기반 RDMA가 제공하는 성능상의 이점을 보여줍니다. 본 논문에서는 NCCL의 Device API를 기반으로 완전히 설계된 MoE 통신 라이브러리인 NCCL EP (Expert Parallelism)를 소개합니다. NCCL EP는 C 및 Python 인터페이스를 모두 지원하는 통합된 ncclEpDispatch 및 ncclEpCombine 기능을 제공하며, 추론 디코딩을 위한 Low-Latency (LL) 모드와 학습 및 추론 프리필을 위한 High-Throughput (HT) 모드를 지원합니다. LL 모드는 1~128 토큰의 작은 배치 크기에 대해 NVLink를 통한 직접적인 All-to-All RDMA 연결을 사용하며, 이중 버퍼 통신을 통해 분산 및 결합 단계를 오버랩시킵니다. HT 모드는 4096 토큰 이상의 큰 배치 크기에 대해 NVLink 도메인 내에서 토큰을 집계한 후 노드 간 RDMA 전송을 수행하는 계층적 통신을 사용합니다. 두 모드 모두 노드 내외 통신 모두에 Device API를 활용하여 토폴로지 인식 및 최적화된 GPU 기반 구현의 장점을 제공합니다. 본 논문에서는 H100 기반 클러스터에서 다양한 노드 구성으로 NCCL EP를 평가하고, 경쟁력 있는 LL 커널 성능을 보여주며, vLLM과의 통합을 통한 전체 시스템 성능을 제시합니다. NCCL EP는 NCCL 내에 MoE 통신을 기본적으로 구현함으로써 현재 및 미래의 NVIDIA 플랫폼에서 전문가 병렬 처리를 위한 안정적인 경로를 제공합니다.

Original Abstract

Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.

4 Citations

0 Influential

12.5 Altmetric

66.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!