2601.11590v1 Jan 05, 2026 cs.DC

EPD-Serve: Ascend 플랫폼 기반의 유연한 다중 모드 EPD 분산 추론 서비스 시스템

EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend

Huan Lin

Citations: 13,578

h-index: 6

Weizhe Lin

Citations: 32

h-index: 3

Fan Bai

Citations: 321

h-index: 3

Pai Peng

Citations: 22

h-index: 2

Z. Tang

Citations: 54

h-index: 5

Zhe Wang

Citations: 4

h-index: 1

Gong Chen

Citations: 1

h-index: 1

Xiang Lu

Citations: 322

h-index: 5

Yinuo Li

Citations: 110

h-index: 5

Yaoyuan Wang

Citations: 61

h-index: 4

Xiaosong Li

Citations: 4

h-index: 1

대규모 다중 모드 모델의 광범위한 도입과 함께 텍스트, 이미지, 오디오 및 비디오 모달리티에 걸친 효율적인 추론이 매우 중요해졌습니다. 그러나 기존의 다중 모드 추론 시스템은 일반적으로 동종 하드웨어에서 인코딩(Encode), 프리필(Prefill) 및 디코딩(Decode) 단계를 밀접하게 결합한 단일 아키텍처를 사용하며, 각 단계의 이질적인 계산 특성을 고려하지 않아 자원 활용률이 낮고 시스템 처리량이 제한됩니다. 이러한 문제를 해결하기 위해, 본 논문에서는 다중 모드 모델을 위한 단계별 분산 추론 서비스 시스템인 EPD-Serve를 제안합니다. EPD-Serve는 추론 파이프라인을 독립적인 인코딩, 프리필 및 디코딩 단계로 분리하여 논리적 격리 및 동적 오케스트레이션을 통한 유연한 공존 배포를 가능하게 합니다. EPD-Serve는 Ascend 상호 연결 토폴로지를 활용하여 인코딩 및 프리필 단계 간의 비동기적 특징 프리페칭과 프리필 및 디코딩 단계 간의 계층적 그룹화된 KV 캐시 전송 메커니즘을 도입하여 노드 간 통신 효율성을 향상시킵니다. 또한, EPD-Serve는 다중 경로 스케줄링, 인스턴스 수준의 로드 밸런싱 및 다중 단계 하드웨어 공동 배치와 공간적 멀티플렉싱을 통합하여 다양한 다중 모드 워크로드를 보다 효과적으로 지원합니다. 다중 모드 이해 모델에 대한 종합적인 실험 결과는 EPD-Serve가 고병렬 환경에서 PD 분산 배포 방식보다 전체 처리량을 57.37%에서 69.48% 향상시키며, 2000ms 미만의 TTFT(Time To First Token) 및 50ms 미만의 TPOT(Time Per Operation Time)과 같은 엄격한 SLO(Service Level Objective) 제약 조건을 만족한다는 것을 보여줍니다. 이러한 결과는 단계별 분산이 다중 모드 대규모 모델 추론 시스템을 최적화하는 데 효과적임을 강조합니다.

Original Abstract

With the widespread adoption of large multimodal models, efficient inference across text, image, audio, and video modalities has become critical. However, existing multimodal inference systems typically employ monolithic architectures that tightly couple the Encode, Prefill, and Decode stages on homogeneous hardware, neglecting the heterogeneous computational characteristics of each stage. This design leads to inefficient resource utilization and limited system throughput. To address these issues, we propose EPD-Serve, a stage-level disaggregated inference serving system for multimodal models. EPD-Serve decouples the inference pipeline into independent Encode, Prefill, and Decode stages, enabling logical isolation and flexible co-located deployment through dynamic orchestration. Leveraging the Ascend interconnect topology, EPD-Serve introduces asynchronous feature prefetching between Encode and Prefill stages and a hierarchical grouped KV cache transmission mechanism between Prefill and Decode stages to improve cross-node communication efficiency. In addition, EPD-Serve incorporates multi-route scheduling, instance-level load balancing, and multi-stage hardware co-location with spatial multiplexing to better support diverse multimodal workloads. Comprehensive experiments on multimodal understanding models demonstrate that, under high-concurrency scenarios, EPD-Serve improves end-to-end throughput by 57.37-69.48% compared to PD-disaggregated deployment, while satisfying strict SLO constraints, including TTFT below 2000 ms and TPOT below 50 ms. These results highlight the effectiveness of stage-level disaggregation for optimizing multimodal large model inference systems.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!