2603.29010v1 Mar 30, 2026 cs.LG

도메인 특화 언어 및 광속 기반 지침을 활용한 GPU 커널 최적화 에이전트의 효율성 향상

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

S. Hari

Citations: 19

h-index: 3

S. Damani

Citations: 143

h-index: 7

Vignesh Balaji

Citations: 400

h-index: 5

Qijing Huang

Citations: 164

h-index: 3

Christos Kozyrakis

Citations: 13

h-index: 3

LLM 에이전트를 활용한 GPU 커널 최적화는 광범위한 설계 공간에 대한 반복적인 과정입니다. 각 후보는 생성, 컴파일, 검증 및 프로파일링 과정을 거쳐야 하므로, 시도 횟수를 줄이는 것이 런타임 및 비용 절약에 도움이 됩니다. 본 연구에서는 두 가지 중요한 점을 발견했습니다. 첫째, 에이전트가 작동하는 추상화 수준이 중요합니다. 추상화 수준이 너무 낮으면 LLM이 중요하지 않은 세부 사항에 불필요한 연산을 수행하며, 너무 높으면 중요한 최적화 선택을 놓칠 수 있습니다. 둘째, 에이전트는 수익이 감소하는 지점을 쉽게 파악하지 못하여 계속 탐색하면서 자원을 낭비할 수 있습니다. 이러한 점을 고려하여 효율성을 향상시키기 위한 두 가지 설계 원칙을 제안합니다. (1) 컨텍스트 내에서 학습할 수 있는 간결한 도메인 특화 언어(DSL)를 사용하여 모델이 더 높은 수준에서 추론하면서 중요한 최적화 요소를 유지하고, (2) 첫 번째 원리에 기반한 성능 경계를 활용하여 탐색을 안내하고 예산을 관리하는 광속(Speed-of-Light, SOL) 지침입니다. 이러한 원칙을 $μ$CUTLASS라는 DSL에 구현했습니다. $μ$CUTLASS는 CUTLASS 기반 GPU 커널을 위한 컴파일러를 포함하며, 커널 구성, 에필로그 퓨전 및 다단계 파이프라인을 지원합니다. 또한, SOL 지침을 사용하여 성능 향상 가능성을 예측하고 최적화 시도를 안내하며, SOL에 가까운 문제를 우선순위에서 낮추고, 벤치마크 조작 가능성이 있는 커널을 식별합니다. 59개의 KernelBench 문제에 대해 동일한 반복 횟수를 사용했을 때, GPT-5-mini를 사용하여 저수준 코드를 생성하는 대신 DSL 코드를 사용하면 PyTorch에 비해 0.40배의 성능 저하에서 1.27배의 속도 향상으로 개선됩니다. 여기에 SOL 기반 지침을 추가하면 이 수치가 1.56배로 향상됩니다. 모델 수준에 관계없이, $μ$CUTLASS + SOL 지침은 더 낮은 토큰 비용으로 더 강력한 기본 에이전트보다 뛰어난 성능을 발휘합니다. SOL 기반 예산 관리는 토큰 사용량을 19-43% 절약하면서 최소 95%의 지오미안 속도 향상을 유지하며, 최적의 정책을 사용하면 1.68배의 효율성 향상을 달성합니다. 마지막으로, SOL 분석은 벤치마크 조작 사례를 식별하는 데 도움이 되며, 일부 커널은 빠르게 실행될 수 있지만 의도된 연산을 수행하지 못하는 경우가 있습니다.

Original Abstract

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer and budget search. We implement these principles in $μ$CUTLASS, a DSL with a compiler for CUTLASS-backed GPU kernels that covers kernel configuration, epilogue fusion, and multi-stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low-level code to DSL code using GPT-5-mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL-guided steering raises this to 1.56x. Across model tiers, $μ$CUTLASS + SOL-guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL-guided budgeting saves 19-43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark-gaming cases, where kernels may appear fast while failing to perform the intended computation.

3 Citations

0 Influential

3.5 Altmetric

20.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!