2604.15464v1 Apr 16, 2026 cs.PF

불규칙 페이지 어텐션: TPU를 위한 고성능 및 유연한 LLM 추론 커널

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

Blake A. Hechtman

Citations: 5,867

h-index: 16

Jevin Jiang

Citations: 3

h-index: 1

Yarong Mu

Citations: 0

h-index: 0

Ying Chen

Citations: 11

h-index: 2

Feng Zhang

Citations: 2

h-index: 1

대규모 언어 모델(LLM) 배포는 점점 더 비용 효율적인 가속기, 특히 Google의 Tensor Processing Units (TPU)로 이동하고 있으며, 성능과 총 소유 비용(TCO)을 모두 중요하게 고려합니다. 그러나 기존의 LLM 추론 커널 및 서비스 시스템은 주로 GPU 중심이며, 특히 현대 서비스 환경에서 흔히 나타나는 동적이고 불규칙한 실행 패턴을 고려하여 LLM 워크로드를 TPU 아키텍처에 효율적으로 매핑하는 잘 확립된 방법은 아직 없습니다. 본 논문에서는 Pallas 및 Mosaic을 사용하여 구현된 TPU용 고성능 및 유연한 어텐션 커널인 Ragged Paged Attention (RPA)을 제시합니다. RPA는 다음 세 가지 핵심 기술을 통해 이러한 과제를 해결합니다. (1) 미세한 타일링을 통해 불규칙한 메모리에 대한 효율적인 동적 분할을 가능하게 합니다. (2) KV 캐시 업데이트를 어텐션 계산과 통합하는 맞춤형 소프트웨어 파이프라인을 사용합니다. (3) 디코딩, 프리필 및 혼합 워크로드에 대한 특수 커널을 생성하는 배포 인식을 갖춘 컴파일 전략을 사용합니다. TPU7x에서 Llama 3 8B 모델로 평가한 결과, RPA는 디코딩에서 최대 86%의 메모리 대역폭 활용률(MBU)과 프리필에서 73%의 모델 FLOPs 활용률(MFU)을 달성했습니다. vLLM 및 SGLang의 주요 TPU 백엔드로 통합된 RPA는 효율적인 TPU 추론을 위한 프로덕션 수준의 기반을 제공하며, 커널 설계에 대한 실질적인 통찰력을 제공합니다.

Original Abstract

Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance and flexible attention kernel for TPUs, implemented using Pallas and Mosaic. RPA addresses these challenges through three key techniques: (1) fine-grained tiling to enable efficient dynamic slicing over ragged memory, (2) a custom software pipeline that fuses KV cache updates with attention computation, and (3) a distribution-aware compilation strategy that generates specialized kernels for decode, prefill, and mixed workloads. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. Integrated as the primary TPU backend in vLLM and SGLang, RPA provides a production-grade foundation for efficient TPU inference and offers practical insights into kernel design.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!