2602.19762v1 Feb 23, 2026 cs.PL

Hexagon-MLIR: 퀄컴의 신경망 처리 장치(NPU)를 위한 AI 컴파일 스택

Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)

M. Baskaran

Citations: 2,473

h-index: 22

Abhikrant Sharma

Citations: 1

h-index: 1

Abhilash Bhandari

Citations: 4

h-index: 1

Ankita Aggarwal

Citations: 31

h-index: 2

A. Rangasamy

Citations: 119

h-index: 5

Dibyendu Das

Citations: 101

h-index: 4

Fateme S. Hosseini

Citations: 102

h-index: 5

I. Brumar

Citations: 42

h-index: 4

Jyotsna Verma

Citations: 20

h-index: 3

Krishnaprasad Bindumadhavan

Citations: 1

h-index: 1

Mitesh Kothari

Citations: 2

h-index: 1

Ravishankar Kolachana

Citations: 1

h-index: 1

R. Lethin

Citations: 1,567

h-index: 20

Samarth Narang

Citations: 1

h-index: 1

S. M. Ladwa

Citations: 77

h-index: 3

Snigdha Dalvi

Citations: 4

h-index: 1

Tasmiat Rahman

Citations: 9

h-index: 1

Venkat Rasagna Reddy Komatireddy

Citations: 1

h-index: 1

Vivek Vasudevbhai Pandya

Citations: 1

h-index: 1

Xiyu Shi

Citations: 52

h-index: 4

Zachary Zipper

Citations: 1

h-index: 1

M. Gupta

Citations: 126

h-index: 4

Shalini Jain

Citations: 115

h-index: 4

J. Absar

Citations: 342

h-index: 8

F. Slama

Citations: 17

h-index: 2

본 논문에서는 퀄컴 Hexagon 신경망 처리 장치(NPU)를 대상으로 하며 트리톤(Triton) 커널 및 파이토치(PyTorch) 모델의 로어링(lowering)을 위한 통합 지원을 제공하는 오픈 소스 컴파일 스택인 Hexagon-MLIR을 소개한다. MLIR 프레임워크를 기반으로 구축된 본 컴파일러는 AI 워크로드를 가속화하기 위해 NPU 아키텍처의 특징을 활용하는 구조화된 패스(pass) 시퀀스를 적용한다. 커널에서 바이너리로의 자동화된 컴파일을 제공함으로써 타겟 디바이스에서 새로운 트리톤 커널(수작업으로 작성되었거나 PyTorch 2.0의 서브그래프)을 더 빠르게 배포할 수 있도록 한다. 트리톤 커널을 입력받아 NPU의 밀결합 메모리(Tightly Coupled Memory, TCM)에서 데이터 지역성을 극대화하는 메가 커널(mega-kernels)을 생성하고, 이를 통해 기존 라이브러리 기반 접근 방식에 내재된 대역폭 병목 현상을 감소시킨다. 이 프로젝트는 개발자에게 유연한 접근 방식을 통해 AI 컴파일 역량을 향상시킬 수 있는 길을 열어주는 오픈 소스 MLIR 기반 컴파일 스택을 제공함으로써 자사의 상용 툴체인을 보완한다. Hexagon-MLIR은 현재 개발이 진행 중인 프로젝트이며, 이와 관련하여 더 많은 최적화와 기능을 지속적으로 추가하고 있다.

Original Abstract

In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.

1 Citations

0 Influential

11 Altmetric

56.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!