2604.24088v1 Apr 27, 2026 cs.DC

TACO: 확장 가능한 텐서 병렬 LLM 훈련을 위한 중간 텐서의 효율적인 통신 압축

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Dingwen Tao

Citations: 26

h-index: 3

Hairui Zhao

Citations: 29

h-index: 3

Wenjing Huang

Citations: 12

h-index: 2

Man Liu

Citations: 2

h-index: 1

Xingjian Tian

Citations: 7

h-index: 1

Bing Lu

Citations: 7

h-index: 1

Sheng Lyu

Citations: 12

h-index: 2

Shengquan Yin

Citations: 3

h-index: 1

Zheng Wei

Citations: 7

h-index: 2

Guangming Tan

Citations: 24

h-index: 3

Xingchen Liu

Citations: 5

h-index: 1

대규모 텐서 병렬 훈련에서 통신 오버헤드를 처리하는 것은 여전히 중요한 과제이며, 특히 중간 텐서의 밀도가 높고 0에 가까운 분포를 가지는 경우, 빈번한 통신으로 인해 오류가 발생하고 압축 과정에서 상당한 계산 오버헤드가 발생합니다. 이러한 문제를 해결하기 위해, 우리는 TP(Tensor Parallelism) 중간 텐서를 압축하기 위한 강력한 FP8 기반 프레임워크인 TACO(Tensor-parallel Adaptive COmmunication compression)를 제안합니다. 첫째, 데이터 기반의 리쉐이핑 전략과 적응형 스케일-하더마드 변환을 사용하여 고정밀 FP8 양자화를 가능하게 하고, 이중 스케일 양자화 메커니즘을 통해 훈련 전반에 걸쳐 수치적 안정성을 보장합니다. 둘째, 메모리 트래픽과 커널 실행 오버헤드를 줄이기 위한 고도로 결합된 압축 연산자를 설계하여 통신과의 효율적인 중첩을 가능하게 합니다. 마지막으로, TACO를 기존의 최첨단 데이터 병렬 및 파이프라인 병렬 방법과 통합하여 압축 기능을 갖춘 3D 병렬 훈련 프레임워크를 개발했습니다. GPT 모델 및 Qwen 모델에 대한 자세한 실험 결과, TACO는 전체 처리량을 최대 1.87배 향상시키면서 거의 손실 없는 정확도를 유지하는 것으로 나타났으며, 이는 대규모 훈련에서 TACO의 효과와 효율성을 입증합니다.

Original Abstract

Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce significant computational overhead during compression. To this end, we propose TACO (Tensor-parallel Adaptive COmmunication compression), a robust FP8-based framework for compressing TP intermediate tensors. First, we employ a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout training. Second, we design a highly fused compression operator to reduce memory traffic and kernel launch overhead, allowing efficient overlap with communication. Finally, we integrate TACO with existing state-of-the-art methods for Data and Pipeline Parallelism to develop a compression-enabled 3D-parallel training framework. Detailed experiments on GPT models and Qwen model demonstrate up to 1.87X end-to-end throughput improvement while maintaining near-lossless accuracy, validating the effectiveness and efficiency of TACO in large-scale training.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!