2601.15710v1 Jan 22, 2026 cs.AR

FlexLLM: 유연한 하이브리드 LLM 가속기 설계용 모듈화된 HLS 라이브러리

FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design

Jason Cong

Citations: 189

h-index: 8

Yizhou Sun

Citations: 234

h-index: 8

Jiahao Zhang

Citations: 5

h-index: 2

Zifan He

Citations: 74

h-index: 6

Nicholas Fraser

Citations: 6

h-index: 2

M. Blott

Citations: 54

h-index: 4

본 논문에서는 도메인 특화 LLM 가속기 개발을 위한 빠른 개발을 지원하는 모듈화된 고수준 합성(HLS) 라이브러리인 FlexLLM을 소개합니다. FlexLLM은 단계별 맞춤형 추론을 위한 주요 아키텍처 자유도를 제공하여, 프리필(prefill) 및 디코딩(decode) 단계에서 시간 재사용 및 공간 데이터 흐름을 다르게 조정하는 하이브리드 설계를 가능하게 합니다. 또한, 정확한 저비트 배포를 지원하기 위한 포괄적인 양자화(quantization) 기능을 제공합니다. FlexLLM을 사용하여 1,000줄 미만의 코드로 Llama-3.2 1B 모델에 대한 완전한 추론 시스템을 2개월 이내에 구축했습니다. 이 시스템은 (1) 하드웨어 효율적인 양자화를 갖춘 단계별 맞춤형 가속기로, SpinQuant 기준 성능을 능가하는 12.68 WikiText-2 PPL을 달성하며, (2) 효율적인 장문(long-context) 처리를 위한 계층적 메모리 트랜스포머(Hierarchical Memory Transformer, HMT) 플러그인을 포함합니다. 16nm AMD U280 FPGA에서 이 가속기는 7nm NVIDIA A100 GPU(BF16 추론)보다 1.29배 더 빠른 전체 속도, 1.64배 더 높은 디코딩 처리량, 3.14배 더 나은 에너지 효율을 달성했습니다. 7nm V80 FPGA에서의 예상 결과는 각각 4.71배, 6.55배, 4.13배에 달합니다. 장문 처리 시나리오에서 HMT 플러그인을 통합하면 프리필 지연 시간을 23.23배 줄이고 컨텍스트 창을 64배 확장하여, U280/V80에서 A100 기준보다 1.10배/4.86배 더 낮은 전체 지연 시간과 5.21배/6.27배 더 높은 에너지 효율을 제공합니다. FlexLLM은 LLM 추론 알고리즘 혁신과 고성능 가속기를 최소한의 수동 노력으로 연결합니다.

Original Abstract

We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$\times$ end-to-end speedup, 1.64$\times$ higher decode throughput, and 3.14$\times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71$\times$, 6.55$\times$, and 4.13$\times$, respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23$\times$ and extends the context window by 64$\times$, delivering 1.10$\times$/4.86$\times$ lower end-to-end latency and 5.21$\times$/6.27$\times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort.

2 Citations

0 Influential

4 Altmetric

22.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!