2602.19128v1 Feb 22, 2026 cs.AI

K-Search: 공진화하는 내재적 세계 모델을 통한 LLM 커널 생성

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

Ion Stoica

Citations: 2,264

h-index: 10

Shiyi Cao

Citations: 2,122

h-index: 14

Ziming Mao

Citations: 152

h-index: 7

Joseph Gonzalez

Citations: 116

h-index: 5

GPU 커널 최적화는 효율적인 현대 기계 학습 시스템에 필수적이지만, 설계 요소들의 복잡한 상호 작용과 빠른 하드웨어 진화로 인해 여전히 어려운 과제로 남아 있습니다. 기존의 자동화된 접근 방식은 대개 대규모 언어 모델(LLM)을 휴리스틱 기반 진화 루프 내에서 단순히 확률적 코드 생성기로 취급합니다. 이러한 방법들은 명시적인 계획 능력이 부족하고 비효율적이거나 부정확한 중간 구현으로 인해 유망한 전략을 자주 폐기하기 때문에, 조정된 다단계 구조적 변환이 필요한 복잡한 커널에서 종종 어려움을 겪습니다. 이를 해결하기 위해 우리는 공진화하는 세계 모델을 통한 탐색(Search via Co-Evolving World Model)을 제안하고, 이 방법을 기반으로 K-Search를 구축합니다. 정적 탐색 휴리스틱을 공진화하는 세계 모델로 대체함으로써, 우리의 프레임워크는 탐색을 유도하고 최적화 공간을 적극적으로 탐색하기 위해 LLM의 사전 도메인 지식을 활용합니다. 이 접근 방식은 고수준의 알고리즘 계획을 저수준의 프로그램 인스턴스화와 명시적으로 분리하여, 시스템이 일시적인 구현 결함에 대한 탄력성을 유지하면서 비단조적 최적화 경로를 탐색할 수 있도록 합니다. 우리는 GQA, MLA 및 MoE 커널을 포함하여 FlashInfer의 다양하고 복잡한 커널에서 K-Search를 평가합니다. 우리의 결과는 K-Search가 최첨단 진화 탐색 방법을 크게 능가하며, 평균 2.10배의 향상과 복잡한 MoE 커널에서 최대 14.3배의 성능 향상을 달성함을 보여줍니다. GPUMode TriMul 작업에서 K-Search는 H100에서 1030us에 도달하여 이전의 진화적 솔루션과 인간이 설계한 솔루션을 모두 뛰어넘으며 최고 수준(state-of-the-art)의 성능을 달성합니다.

Original Abstract

Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co-Evolving World Model and build K-Search based on this method. By replacing static search heuristics with a co-evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K-Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K-Search significantly outperforms state-of-the-art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K-Search achieves state-of-the-art performance on H100, reaching 1030us and surpassing both prior evolution and human-designed solutions.

10 Citations

1 Influential

7 Altmetric

47.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!