2603.05800v1 Mar 06, 2026 cs.DC

StreamWise: 실시간으로 대규모의 멀티모달 생성 서비스를 제공하는 방법

StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

Haoran Qiu

Citations: 118

h-index: 5

G. Chaudhry

Citations: 1,384

h-index: 8

Chaojie Zhang

Citations: 1,193

h-index: 9

Esha Choukse

Citations: 1,894

h-index: 18

Rodrigo Fonseca

Citations: 90

h-index: 3

Ricardo Bianchini

Citations: 1,056

h-index: 9

Íñigo Goiri

Citations: 487

h-index: 9

멀티모달 생성 모델의 발전은 스토리텔링부터 자동 미디어 합성까지 다양한 분야에서 새로운 가능성을 열고 있습니다. 하지만 현재 대부분의 시스템은 간단한 결과물(예: 프롬프트로부터 이미지 생성)을 배치 모드로 생성하며, 기본적인 결과물조차 몇 초가 걸리는 경우가 많습니다. 실시간으로 대규모 멀티모달 워크플로우를 제공하는 것은 비용이 많이 들고 복잡하며, 언어, 오디오, 이미지, 비디오 등 다양한 모델(각각 고유한 리소스 요구 사항을 가짐)을 엄격한 지연 시간 및 리소스 제약 조건 하에서 효율적으로 조정해야 합니다. 본 연구에서는 실시간 팟캐스트 비디오 생성이라는 관점에서 이러한 문제를 해결하고자 LLM, 텍스트-음성 변환, 비디오-오디오 생성 기술을 통합했습니다. 엄격한 서비스 수준 목표(SLO)를 충족하기 위해, 우리는 품질(예: 해상도, 선명도), 모델/콘텐츠 병렬 처리, 리소스 기반 스케줄링을 동적으로 관리하는 적응형 모듈형 서비스 시스템인 StreamWise를 설계했습니다. 또한, 시스템 응답성과 효율성을 극대화하기 위해 이기종 하드웨어를 활용합니다. 예를 들어, 시스템은 비디오 해상도를 낮추고 초기 장면의 리소스 할당량을 늘릴 수 있습니다. 본 연구에서는 지연 시간, 비용, 품질 간의 균형을 정량적으로 분석했습니다. 가장 저렴한 설정으로 A100 GPU를 사용하여 10분짜리 팟캐스트 비디오를 생성하는 데 1.4시간이 소요되며(실시간보다 8.4배 느림) 25달러 미만의 비용이 듭니다. StreamWise는 45달러 미만의 비용으로, 거의 즉각적인 시작 지연 시간을 가진 고품질 실시간 스트리밍을 가능하게 합니다.

Original Abstract

Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \$25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under $45.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!