2601.00227v1 Jan 01, 2026 cs.AI

FlashInfer-Bench: AI 기반 LLM 시스템을 위한 선순환 구축

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing

Citations: 6

h-index: 1

Yiyan Zhai

Citations: 22

h-index: 2

Alexander Jiang

Citations: 6

h-index: 1

Yixin Dong

Citations: 70

h-index: 3

Yong Wu

Citations: 44

h-index: 4

Zihao Ye

Citations: 19

h-index: 2

Charlie F. Ruan

Citations: 75

h-index: 3

Yingyi Huang

Citations: 22

h-index: 2

Yineng Zhang

Citations: 379

h-index: 5

Aksara Bayyapu

Citations: 6

h-index: 1

Luis Ceze

Citations: 862

h-index: 8

Tianqi Chen

Citations: 688

h-index: 5

Liangsheng Yin

Citations: 1,096

h-index: 5

최근의 발전은 거대 언어 모델(LLM)이 GPU 커널을 생성할 수 있는 자율 에이전트 역할을 수행할 수 있음을 보여주지만, 이렇게 AI가 생성한 커널을 실제 추론 시스템에 통합하는 것은 여전히 어려운 과제로 남아 있습니다. FlashInfer-Bench는 커널 생성, 벤치마킹 및 배포를 연결하는 표준화된 폐루프(closed-loop) 프레임워크를 구축하여 이러한 격차를 해소합니다. 그 핵심인 FlashInfer Trace는 커널 정의, 워크로드, 구현 및 평가를 기술하는 통합 스키마를 제공하여 에이전트와 시스템 간의 일관된 통신을 가능하게 합니다. 실제 서빙 트레이스를 기반으로 구축된 FlashInfer-Bench는 정제된 데이터셋, 정확성 및 성능을 고려한 견고한 벤치마킹 프레임워크, LLM 에이전트의 GPU 프로그래밍 능력을 추적하는 공개 리더보드, 그리고 최고 성능의 커널을 SGLang 및 vLLM과 같은 프로덕션 LLM 엔진에 매끄럽게 주입하는 동적 대체 메커니즘(apply())을 포함합니다. 우리는 FlashInfer-Bench를 사용하여 LLM 에이전트의 성능과 한계를 평가하고, 다양한 GPU 프로그래밍 언어 간의 트레이드오프를 비교하며, 향후 에이전트 설계를 위한 통찰력을 제공합니다. 결과적으로 FlashInfer-Bench는 AI 생성 커널을 지속적으로 개선하고 이를 대규모 LLM 추론에 배포하기 위한 실용적이고 재현 가능한 경로를 확립합니다.

Original Abstract

Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents' GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.

6 Citations

0 Influential

4 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!