2604.09107v1 Apr 10, 2026 cs.DC

TensorHub: LLM 강화 학습 훈련을 위한 확장 가능하고 유연한 가중치 전송 시스템

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

He Sun

Citations: 113

h-index: 4

Huaizheng Zhang

Citations: 455

h-index: 13

B. Zhong

Citations: 494

h-index: 4

Qixiang Chen

Citations: 36

h-index: 3

Weidong Zhang

Citations: 6

h-index: 1

Kaihua Jiang

Citations: 137

h-index: 4

Andrea C. Arpaci-Dusseau

Citations: 11,167

h-index: 59

Remzi H. Arpaci-Dusseau

Citations: 10,541

h-index: 57

Chen Ye

Citations: 2

h-index: 1

Ming Han

Citations: 4

h-index: 1

Xiang Li

Citations: 444

h-index: 5

Xinyi Zhang

Citations: 173

h-index: 2

Wang Zhang

Citations: 3,422

h-index: 6

Wencong Xiao

Citations: 268

h-index: 5

최신 LLM 강화 학습(RL) 워크로드는 다양한 컴퓨팅 자원으로 훈련을 확장하기 위해 매우 효율적인 가중치 전송 시스템을 필요로 합니다. 그러나 기존의 가중치 전송 방식은 동적으로 클러스터를 확장하는 데 필요한 유연성을 제공하지 못하거나, 근본적인 데이터 이동 오버헤드를 발생시켜 성능 저하를 초래합니다. 우리는 RL 가중치 전송을 위한 새로운 저장 추상화인 Reference-Oriented Storage (ROS)를 소개합니다. ROS는 모델 가중치의 높은 복제본을 활용하여, 특정 버전의 가중치가 저장되어 있고 필요에 따라 가져올 수 있다는 착시 현상을 제공합니다. ROS는 실제로는 가중치의 복사본을 저장하지 않으며, 대신 추론을 위해 GPU에서 이러한 가중치를 보유하고 있는 워커를 추적합니다. 요청 시, ROS는 이를 직접 사용하여 읽기 작업을 수행합니다. 우리는 ROS의 아이디어를 토폴로지 최적화된 전송, 강력한 일관성 및 오류 허용 기능으로 확장한 TensorHub라는 프로덕션 품질의 시스템을 구축했습니다. 실험 결과, TensorHub는 RDMA 대역폭을 최대한 활용하며, 최소한의 엔지니어링 노력으로 세 가지 서로 다른 롤아웃 워크로드에 적응합니다. 특히, TensorHub는 독립적인 롤아웃에서 총 GPU 대기 시간을 최대 6.7배 단축하고, 탄력적인 롤아웃에서 가중치 업데이트 속도를 4.8배 향상시키며, 데이터 센터 간 롤아웃 대기 시간을 19배 단축합니다. TensorHub는 최첨단 RL 훈련을 지원하기 위해 실제 환경에 배포되었습니다.

Original Abstract

Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.

1 Citations

0 Influential

29.5 Altmetric

148.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!