2603.29002v1 Mar 30, 2026 cs.DC

분산 LLM 추론을 위한 메모리 처리 파이프라인 이해 및 가속화

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Jason Cong

Citations: 189

h-index: 8

Yizhou Sun

Citations: 234

h-index: 8

Zifan He

Citations: 74

h-index: 6

Rui Ma

Citations: 7

h-index: 1

최신 대규모 언어 모델(LLM)은 복잡한 추론을 지원하기 위해 희소 어텐션, 검색 증강 생성(RAG) 및 압축된 컨텍스트 메모리와 같은 효율적인 장기 컨텍스트 처리 및 생성 메커니즘에 점점 더 의존하고 있습니다. 본 연구에서는 이러한 최적화가 '메모리 준비', '관련성 계산', '검색' 및 '추론 적용'의 네 단계로 구성된 메모리 처리 파이프라인으로 통합될 수 있음을 보여줍니다. 체계적인 프로파일링을 통해 LLM 추론 과정에서 22%에서 97%의 메모리 처리 오버헤드가 발생하며, 계산 특성 또한 매우 이질적임을 확인했습니다. 이러한 분석 결과를 바탕으로, **이질적인 시스템**이 메모리 처리를 가속화하고 궁극적으로 전체 추론 성능을 향상시키는 데 적합하다고 주장합니다. 본 연구에서는 GPU-FPGA 시스템을 통해 희소하고 불규칙하며 메모리 병목 현상이 발생하는 연산을 FPGA로 오프로딩하고, 연산 집약적인 연산은 GPU에 유지하는 방식으로 이 접근 방식을 구현했습니다. AMD MI210 GPU와 Alveo U55C FPGA에서 성능을 평가한 결과, 제안하는 시스템은 다양한 LLM 추론 최적화 환경에서 GPU 기반 시스템보다 $1.04 imes$에서 $2.2 imes$ 더 빠르고, $1.11 imes$에서 $4.7 imes$ 더 적은 에너지를 소비하는 것으로 나타났습니다(NVIDIA A100에서도 유사한 결과가 나타났습니다). 이러한 결과는 이질적인 시스템이 효율적인 LLM 메모리 처리를 위한 실질적인 방법이며, 향후 이질적인 하드웨어 설계에 중요한 정보를 제공한다는 것을 입증합니다.

Original Abstract

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is $1.04\sim2.2\times$ faster and requires $1.11\sim4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!