2602.00277v1 Jan 30, 2026 cs.DC

10만 개 GPU를 사용한 오류 허용 HSDP를 이용한 LLM 학습

Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

Mathew Oldham

Citations: 16,229

h-index: 6

Min Si

Citations: 16,126

h-index: 5

Sharan Narang

Citations: 82,679

h-index: 31

Wenyin Fu

Citations: 33,692

h-index: 9

Adi Gangidi

Citations: 16,701

h-index: 9

Feng Tian

Citations: 16,223

h-index: 6

M. Naumov

Citations: 16,488

h-index: 8

Omkar Salpekar

Citations: 16,419

h-index: 6

V. Ivanov

Citations: 16,170

h-index: 5

Rohan Varma

Citations: 28

h-index: 3

Kenny Yu

Citations: 34

h-index: 3

Yang Wang

Citations: 57

h-index: 3

A. Sharif

Citations: 16

h-index: 3

Shawn Xu

Citations: 178

h-index: 3

Shengbao Zheng

Citations: 223

h-index: 4

Tristan Rice

Citations: 20

h-index: 2

Ankush Garg

Citations: 31

h-index: 4

Shangfu Peng

Citations: 20

h-index: 2

Shreyas Siravara

Citations: 20

h-index: 2

Rodrigo de Castro

Citations: 21

h-index: 2

A. Obraztsov

Citations: 9

h-index: 2

Sergey Edunov

Citations: 191

h-index: 6

Chunqiang Tang

Citations: 57

h-index: 3

대규모 학습 시스템은 일반적으로 동기 학습 방식을 사용하며, 이 방식은 모든 GPU가 동시에 정상적으로 작동해야 합니다. 저희가 10만 개 규모의 GPU로 학습하는 과정에서, 동기 학습은 빈번한 오류와 긴 복구 시간으로 인해 낮은 효율을 보였습니다. 이 문제를 해결하기 위해, 저희는 새로운 학습 패러다임인 오류 허용 하이브리드 분산 병렬 처리(Fault Tolerant Hybrid-Shared Data Parallelism, FT-HSDP)를 제안합니다. FT-HSDP는 데이터 병렬 복제본을 오류 허용의 단위로 사용합니다. 오류가 발생하면, 오류가 발생한 GPU 또는 서버를 포함하는 단일 데이터 병렬 복제본만 시스템에서 제거되고 재시작되며, 나머지 복제본은 계속 학습을 진행합니다. 이 아이디어를 대규모로 구현하기 위해, FT-HSDP는 다음과 같은 기술들을 포함합니다. 1) 데이터 병렬 복제본 간의 기울기 교환을 위한 오류 허용 All Reduce(FTAR) 프로토콜을 도입했습니다. FTAR은 CPU를 사용하여 동적으로 참여자를 추가하거나 제거하는 등 복잡한 제어 로직을 처리하고, GPU를 사용하여 최적의 성능을 위한 데이터 전송을 수행합니다. 2) 복구 중인 복제본이 최소한의 지연으로 학습에 참여할 수 있도록, 비차단 복구 프로토콜을 도입했습니다. 10만 개 GPU 규모의 완전 동기 학습과 비교하여, FT-HSDP는 오류 복구로 인한 지연 시간을 10분에서 3분으로 줄여, 효과적인 학습 시간을 44%에서 80%로 증가시킵니다. 또한, FT-HSDP의 비동기 복구가 결과 모델의 정확도에 의미 있는 저하를 가져오지 않는다는 것을 입증했습니다.

Original Abstract

Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing participants dynamically, and relies on GPU to perform data transfer for best performance. 2) We introduce a non-blocking catch-up protocol, allowing a recovering replica to join training with minimal stall. Compared with fully synchronous training at O(100K) GPUs, FT-HSDP can reduce the stall time due to failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44\% to 80\%. We further demonstrate that FT-HSDP's asynchronous recovery does not bring any meaning degradation to the accuracy of the result model.

6 Citations

1 Influential

15.5 Altmetric

85.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!