2604.18616v1 Apr 16, 2026 cs.DC

ARGUS: 데이터 흐름 불변식을 기반으로 한 에이전트 기반 GPU 최적화

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

Qiuchun Yu

Citations: 12

h-index: 2

Christos Kozyrakis

Citations: 13

h-index: 3

Haohui Mai

Citations: 14

h-index: 2

Binhang Yuan

Citations: 369

h-index: 8

Xiaoyang Guo

Citations: 55

h-index: 2

Xiangyun Ding

Citations: 12

h-index: 2

Daifeng Li

Citations: 19

h-index: 2

Chenzhun Guo

Citations: 0

h-index: 0

Cong Wang

Citations: 32

h-index: 3

JiaCheng Zhao

Citations: 5

h-index: 1

LLM 기반 코딩 에이전트는 기능적으로 올바른 GPU 커널을 생성할 수 있지만, 행렬 곱셈, 어텐션 및 Mixture-of-Experts (MoE)와 같은 중요한 연산에서 최적화된 라이브러리에 비해 성능이 현저히 낮습니다. GPU의 최고 성능을 달성하려면 타일링, 공유 메모리 관리, 소프트웨어 파이프라이닝 및 명령어 스케줄링과 같은 긴밀하게 연관된 최적화를 조정해야 하지만, 기존 에이전트는 희소한 합격/불합격 피드백에 의존하여 전역 제약 조건 위반을 진단할 수 없습니다. 본 논문에서는 데이터 흐름 불변식: 커널 실행 과정에서 데이터가 어떻게 처리되어야 하는지에 대한 컴파일 시간 사양을 통해 이러한 문제를 해결하는 에이전트 기반 프레임워크인 Argus를 제시합니다. Argus는 하드웨어 명령어 및 컴파일러 정책을 노출하면서 저수준 표현을 숨기는 타일 기반의 Pythonic DSL을 도입합니다. 이 DSL은 데이터 및 제어 흐름을 통해 기호 주석을 전파하는 태그 함수와 사용 지점에서 관계 제약을 적용하는 태그 어설션을 제공합니다. 위반이 발생하면 컴파일러는 스레드, 데이터 요소 및 프로그램 지점을 식별하는 구체적인 반례를 반환하여 대상 수정에 대한 밀집적이고 구조화된 피드백을 제공합니다. 불변식은 레이아웃 대수 및 SMT 해결을 통한 추상 해석을 통해 컴파일 시간에 검증되며, 런타임 오버헤드는 없습니다. 컨텍스트 내 강화 학습 플래너는 GPU 최적화 기술에 대한 큐레이팅된 지식 기반의 지원을 받아 최적화를 선택하고 효과적인 불변식을 합성하는 방법을 학습합니다. 본 논문에서는 AMD MI300X GPU에서 GEMM, 플래시 어텐션 및 MoE 커널을 사용하여 Argus를 평가했습니다. 이러한 커널은 LLM 추론에서 GPU 시간을 90% 이상 차지합니다. 생성된 커널은 최첨단 수동 최적화된 어셈블리 처리량을 99-104% 달성하며, 기존 에이전트 시스템보다 2-1543배 빠릅니다. 또한 Argus는 200개의 KernelBench 작업으로 일반화되며, 레벨 1 문제를 100% 해결하고 레벨 2 문제를 90% 해결합니다.

Original Abstract

LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, leaving them unable to diagnose global constraint violations. We present Argus, an agentic framework that addresses this through data-flow invariants: compile-time specifications encoding how data must be choreographed throughout kernel execution. Argus introduces a tile-based, Pythonic DSL exposing hardware instructions and compiler policies while hiding low-level representations. The DSL provides tag functions to propagate symbolic annotations through data and control flow, and tag assertions to enforce relational constraints at use sites. When violations occur, the compiler returns concrete counterexamples identifying the thread, data element, and program point, enabling dense, structured feedback for targeted fixes. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving, with zero runtime overhead. An in-context reinforcement learning planner learns to select optimizations and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques. We evaluate Argus on the AMD MI300X GPU across GEMM, flash attention, and MoE kernels accounting for over 90% of GPU time in LLM inference. Generated kernels achieve 99-104% of state-of-the-art hand-optimized assembly throughput and are 2-1543x faster than existing agentic systems. Argus further generalizes to 200 KernelBench tasks, solving 100% of Level 1 and 90% of Level 2 problems.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!