2601.11652v1 Jan 15, 2026 cs.DC

WISP: 동적 생성 및 SLO 기반 배치 방식을 통한 엣지에서의 폐기물 및 간섭 감소 분산 추론형 LLM 서비스

WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

Dimitrios Spatharakis

Citations: 562

h-index: 12

Jiakun Fan

Citations: 43

h-index: 4

Xiangchen Li

Citations: 21

h-index: 2

Qingyuan Wang

Citations: 81

h-index: 5

Saeid Ghafouri

Citations: 67

h-index: 4

Hans Vandierendonck

Citations: 15

h-index: 1

Deepu John

Citations: 113

h-index: 4

Bo Ji

Citations: 76

h-index: 2

A. Butt

Citations: 51

h-index: 3

Dimitrios S. Nikolopoulos

Citations: 39

h-index: 3

대규모 언어 모델(LLM)이 최종 사용자에게 점점 더 많이 제공됨에 따라, 엣지 장치에서 시작되는 추론 요청이 중앙 집중식 GPU 클러스터에서 계산되는 빈도가 증가하고 있습니다. 그러나 이로 인해 계산 작업량이 기하급수적으로 증가하여 데이터 센터에 상당한 부담을 주고 있으며, 엣지 장치는 여전히 대부분 활용되지 못하여 네트워크 전체에서 작업량 불균형과 리소스 비효율성을 초래합니다. 추론형 디코딩을 통해 엣지 장치를 LLM 추론 프로세스에 통합하면 엣지와 클라우드 간의 작업량을 균형 있게 분산하고 예측 정확도를 유지하는 데 도움이 됩니다. 본 논문에서는 분산 추론형 LLM 서비스의 효율성과 확장성을 제한하는 두 가지 중요한 병목 현상인 '불필요한 생성 시간'과 '검증 간섭'을 식별하고 공식화합니다. 이러한 과제를 해결하기 위해, 지능형 추론 제어기, 검증 시간 추정기, 검증 배치 스케줄러로 구성된 효율적이고 SLO 기반의 분산 LLM 추론 시스템인 WISP를 제안합니다. 이러한 구성 요소들은 협력하여 생성 효율성을 향상시키고 서버에서의 검증 요청 스케줄링을 최적화합니다. 광범위한 수치 결과는 WISP가 중앙 집중식 서비스 및 SLED에 비해 시스템 용량을 최대 2.1배 및 4.1배 향상시키고, 시스템 처리량을 최대 1.94배 및 3.7배 증가시킨다는 것을 보여줍니다.

Original Abstract

As Large Language Models (LLMs) become increasingly accessible to end users, an ever-growing number of inference requests are initiated from edge devices and computed on centralized GPU clusters. However, the resulting exponential growth in computation workload is placing significant strain on data centers, while edge devices remain largely underutilized, leading to imbalanced workloads and resource inefficiency across the network. Integrating edge devices into the LLM inference process via speculative decoding helps balance the workload between the edge and the cloud, while maintaining lossless prediction accuracy. In this paper, we identify and formalize two critical bottlenecks that limit the efficiency and scalability of distributed speculative LLM serving: Wasted Drafting Time and Verification Interference. To address these challenges, we propose WISP, an efficient and SLO-aware distributed LLM inference system that consists of an intelligent speculation controller, a verification time estimator, and a verification batch scheduler. These components collaboratively enhance drafting efficiency and optimize verification request scheduling on the server. Extensive numerical results show that WISP improves system capacity by up to 2.1x and 4.1x, and increases system goodput by up to 1.94x and 3.7x, compared to centralized serving and SLED, respectively.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!