2605.07985v1 May 08, 2026 cs.DC

Dooly: LLM 추론 시뮬레이션을 위한 구성 무관, 중복 방지 프로파일링

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Joo-Young Kim

Citations: 126

h-index: 6

Geon-Woo Kim

Citations: 10

h-index: 2

Anoop Rachakonda

Citations: 0

h-index: 0

Daehyeok Kim

Citations: 56

h-index: 4

최적의 LLM 추론 구성을 선택하려면 하드웨어, 서비스 엔진, 어텐션 백엔드 및 모델 아키텍처 전반에 걸쳐 평가가 필요합니다. 왜냐하면 단일 선택이 모든 워크로드에서 최고의 성능을 보장하지 않기 때문입니다. 프로파일 기반 시뮬레이터는 표준 도구이지만, 기존 방식은 특정 구성에 맞춰 작동 방식을 고정하고 모든 연산을 처음부터 다시 프로파일링하므로 탐색 비용이 매우 높습니다. 이러한 비용은 구조적인 이해 부족에서 비롯됩니다. 각 연산의 모든 입력 차원은 모델 구성에 의해 고정되거나 들어오는 요청에 의해 결정됩니다. 많은 모델 구성 값(예: 헤드 크기, 레이어 수)이 다양한 모델에 걸쳐 반복되므로, 동일한 연산이 여러 구성에서 실행됩니다. 따라서 요청에 의존적인 차원에 대한 단일 검사가 모든 구성에 적용될 수 있습니다. 본 논문에서는 이러한 구조를 활용하여 구성에 무관하고 중복을 방지하는 프로파일링을 수행하는 Dooly를 제안합니다. Dooly는 단일 추론 패스를 수행하고, taint propagation을 통해 각 입력 차원의 출처를 레이블링하며, 지연 시간 데이터베이스에 없는 연산만 선택적으로 프로파일링합니다. 어텐션과 같이 상태를 갖는 연산은 서비스 엔진 자체의 초기화 코드를 재사용하여 분리하므로 수동 계측이 필요하지 않습니다. Dooly는 데이터베이스를 기반으로 지연 시간 회귀 모델을 구축하며, 이는 기존 시뮬레이터의 즉시 사용 가능한 백엔드로 작동합니다. 두 가지 GPU 플랫폼, 세 가지 어텐션 백엔드 및 다양한 모델 아키텍처에서 Dooly는 TTFT(Total Time To First Token)의 경우 5% MAPE(Mean Absolute Percentage Error), TPOT(Tokens Per Second)의 경우 8%의 시뮬레이션 정확도를 달성했으며, 기존 프로파일링 방식과 비교하여 12개 모델에 대해 프로파일링 GPU 시간을 56.4% 줄였습니다.

Original Abstract

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!