2605.04478v1 May 06, 2026 cs.DC

CCL-D: 대규모 모델 훈련 시 발생하는 느린 통신 및 시스템 멈춤 현상에 대한 고정밀 진단 시스템

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Tao Wang

Citations: 37

h-index: 2

Hairui Zhao

Citations: 29

h-index: 3

Haoxu Li

Citations: 8

h-index: 2

Yida Gu

Citations: 18

h-index: 3

Wenjing Huang

Citations: 12

h-index: 2

Xingchen Liu

Citations: 5

h-index: 1

Fakang Wang

Citations: 23

h-index: 2

Jianhao Fu

Citations: 9

h-index: 2

Zhenhang Sun

Citations: 6

h-index: 1

Qianyu Zhang

Citations: 47

h-index: 4

Yang Tian

Citations: 2

h-index: 1

Zedong Liu

Citations: 19

h-index: 3

Yifan Chen

Citations: 122

h-index: 7

Jinwu Yang

Citations: 5

h-index: 1

Yueyuan Zhou

Citations: 12

h-index: 2

Qianqian Zhao

Citations: 10

h-index: 2

Feng-mei Yu

Citations: 13

h-index: 2

Zhan Wang

Citations: 17

h-index: 2

Guangming Tan

Citations: 73

h-index: 5

Dingwen Tao

Citations: 78

h-index: 5

모델 훈련 규모가 커짐에 따라, 집단 통신 라이브러리(CCL)는 하드웨어, 소프트웨어 및 환경 요인 간의 복잡한 상호 작용으로 인해 발생하는 다양한 문제에 직면합니다. 이러한 문제들은 주로 느린 통신 또는 시스템 멈춤 현상으로 나타나며, 이는 가장 빈번하게 발생하고 진단에 많은 시간과 노력을 요구하는 문제입니다. 그러나 기존의 진단 방법은 여전히 부정확하고 비효율적이며, 원인 분석에 수 시간 또는 심지어 며칠이 소요될 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 대규모 분산 훈련에서 발생하는 느린 통신 및 시스템 멈춤 현상을 감지하고 위치시키는 데 특화된 고정밀 진단 시스템인 CCL-D를 제안합니다. CCL-D는 랭크 레벨의 실시간 프로브와 지능형 의사 결정 분석기를 통합합니다. 프로브는 경량 분산 추적 프레임워크를 사용하여 통신 트래픽을 모니터링하고, 계층 간의 이상 징후 지표를 측정합니다. 분석기는 자동화된 이상 징후 감지 및 원인 위치 기능을 수행하여, 문제 발생 GPU 랭크를 정확하게 식별합니다. 1년 동안 4,000개의 GPU 클러스터에 CCL-D를 배포한 결과, 알려진 모든 느린 통신 및 시스템 멈춤 현상에 대해 거의 완벽한 수준의 감지율을 달성했으며, 영향을 받는 랭크를 6분 이내에 정확하게 식별하여 기존 솔루션보다 훨씬 뛰어난 성능을 보였습니다.

Original Abstract

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!