2601.20706v1 Jan 28, 2026 cs.AR

GEMM 중심 아키텍처를 넘어선 NPU: 효율적인 Diffusion LLM 샘플링 지원

Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

Binglei Lou

Citations: 90

h-index: 6

Haoran Wu

Citations: 14

h-index: 2

Jiayi Nie

Citations: 13

h-index: 2

Can Xiao

Citations: 22

h-index: 3

Xuan Guo

Citations: 10

h-index: 1

R. Antonova

Citations: 14

h-index: 2

Robert Mullins

Citations: 13

h-index: 2

Aaron Zhao

Citations: 18

h-index: 2

Yao Lai

University of Cambridge, The University of Hong Kong, Tsinghua University

Citations: 402

h-index: 9

Diffusion Large Language Models (dLLMs)는 반복적인 노이즈 제거 과정을 통해 병렬 토큰 생성을 가능하게 하지만, 샘플링 단계는 GEMM 중심의 Transformer 레이어와 근본적으로 다른 특징을 보입니다. 최신 GPU에서의 프로파일링 결과, 샘플링은 전체 모델 추론 지연 시간의 최대 70%를 차지하며, 이는 주로 어휘 전체에 걸친 로짓 값의 상당한 메모리 로드 및 쓰기, 축소 기반 토큰 선택, 그리고 반복적인 마스킹 업데이트 때문입니다. 이러한 프로세스는 대량의 온칩 SRAM을 요구하며, 기존 NPU가 효율적으로 처리하기 어려운 불규칙적인 메모리 접근을 포함합니다. 이러한 문제를 해결하기 위해, 우리는 dLLM 샘플링에 NPU 아키텍처가 반드시 최적화해야 할 핵심 명령어를 식별했습니다. 우리의 설계는 경량의 GEMM이 아닌 벡터 연산, 메모리 재사용 전략, 그리고 분리된 혼합 정밀도 메모리 계층 구조를 사용합니다. 이러한 최적화 기술들은 동일한 nm 기술 노드에서 NVIDIA RTX A6000 GPU에 비해 최대 2.53배의 속도 향상을 제공합니다. 또한, 우리의 사이클 단위 시뮬레이션 및 합성 후 RTL 검증 코드를 공개하여, 현재 dLLM PyTorch 구현과 기능적으로 동일함을 확인했습니다.

Original Abstract

Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!