2601.06288v1 Jan 09, 2026 cs.LG

AIConfigurator: 다중 프레임워크 LLM 서비스 환경에서의 초고속 구성 최적화

AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

Yipeng Shen

Citations: 44

h-index: 2

Yuanzhe Li

Citations: 42

h-index: 4

Xiang Lu

Citations: 322

h-index: 5

Yiyi Chen

Citations: 70

h-index: 5

Tianhao Xu

Citations: 35

h-index: 4

Yiming Liu

Citations: 89

h-index: 6

Yijia Zhao

Citations: 68

h-index: 6

Aichen Feng

Citations: 9

h-index: 1

Qinghua Zhou

Citations: 32

h-index: 3

Xumeng Chen

Citations: 13

h-index: 2

Ilya Sherstyuk

Citations: 9

h-index: 1

Haorui Li

Citations: 27

h-index: 4

Rishi Thakkar

Citations: 734

h-index: 13

Benoit Hamm

Citations: 13

h-index: 2

Xue Huang

Citations: 14

h-index: 2

Wen Wu

Citations: 17

h-index: 3

Anish Shanbhag

Citations: 22

h-index: 2

H. Kim

Citations: 44

h-index: 4

Chuan Chen

Citations: 9

h-index: 1

Junjie Lai

Citations: 12

h-index: 2

Xu Zhou

Citations: 85

h-index: 5

대규모 언어 모델(LLM) 추론을 실제 시스템에서 최적화하는 것은 동적인 워크로드, 엄격한 지연 시간/처리량 목표, 그리고 빠르게 확장되는 구성 공간으로 인해 점점 더 어려워지고 있습니다. 이러한 복잡성은 분산 병렬 처리 전략(텐서/파이프라인/전문가)뿐만 아니라 CUDA 그래프 활성화, 사용 가능한 KV 캐시 메모리 비율, 최대 토큰 용량과 같은 프레임워크별 런타임 매개변수에도 영향을 미치며, 이는 성능에 큰 영향을 미칩니다. TRT-LLM, vLLM, SGLang과 같은 다양한 현대적인 추론 프레임워크는 각각 고유한 커널과 실행 정책을 사용하므로, 수동 튜닝은 프레임워크별이며 계산적으로 비효율적입니다. 우리는 GPU 기반 프로파일링 없이도 빠르고 프레임워크에 독립적인 추론 구성 검색을 가능하게 하는 통합 성능 모델링 시스템인 AIConfigurator를 제시합니다. AIConfigurator는 (1) 추론을 분석적으로 모델링할 수 있는 기본 요소(GEMM, 어텐션, 통신, 메모리 작업)로 분해하고, 프레임워크별 스케줄링 동적 특성을 반영하는 방법론; (2) 다양한 하드웨어 플랫폼 및 인기 있는 오픈 소스 모델(GPT-OSS, Qwen, DeepSeek, LLama, Mistral)에 대한 이러한 기본 요소의 커널 수준 성능 데이터베이스; (3) 대상 백엔드에 대한 최적의 실행 매개변수를 자동으로 결정하는 추상화 계층을 결합하여, 실제 환경의 오케스트레이션 시스템에 원활하게 통합됩니다. 실제 LLM 서비스 워크로드에 대한 평가 결과, AIConfigurator는 밀집 모델(예: Qwen3-32B)의 성능을 최대 40%, MoE 아키텍처(예: DeepSeek-V3)의 성능을 최대 50% 향상시키는 우수한 서비스 구성을 찾아내며, 평균 30초 이내에 검색을 완료합니다. 이를 통해 클러스터 토폴로지부터 엔진별 플래그에 이르기까지 광범위한 설계 공간을 빠르게 탐색할 수 있습니다.

Original Abstract

Optimizing Large Language Model (LLM) inference in production systems is increasingly difficult due to dynamic workloads, stringent latency/throughput targets, and a rapidly expanding configuration space. This complexity spans not only distributed parallelism strategies (tensor/pipeline/expert) but also intricate framework-specific runtime parameters such as those concerning the enablement of CUDA graphs, available KV-cache memory fractions, and maximum token capacity, which drastically impact performance. The diversity of modern inference frameworks (e.g., TRT-LLM, vLLM, SGLang), each employing distinct kernels and execution policies, makes manual tuning both framework-specific and computationally prohibitive. We present AIConfigurator, a unified performance-modeling system that enables rapid, framework-agnostic inference configuration search without requiring GPU-based profiling. AIConfigurator combines (1) a methodology that decomposes inference into analytically modelable primitives - GEMM, attention, communication, and memory operations while capturing framework-specific scheduling dynamics; (2) a calibrated kernel-level performance database for these primitives across a wide range of hardware platforms and popular open-weights models (GPT-OSS, Qwen, DeepSeek, LLama, Mistral); and (3) an abstraction layer that automatically resolves optimal launch parameters for the target backend, seamlessly integrating into production-grade orchestration systems. Evaluation on production LLM serving workloads demonstrates that AIConfigurator identifies superior serving configurations that improve performance by up to 40% for dense models (e.g., Qwen3-32B) and 50% for MoE architectures (e.g., DeepSeek-V3), while completing searches within 30 seconds on average. Enabling the rapid exploration of vast design spaces - from cluster topology down to engine specific flags.

9 Citations

3 Influential

6.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!