2602.00397v1 Jan 30, 2026 cs.LG

Fast Forward: 예측 기반 FFN 희소화를 통한 LLM 프리필 가속화

Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity

Junyoung Park

Citations: 122

h-index: 5

Mingu Lee

Citations: 171

h-index: 8

Aayush Gautam

Citations: 19

h-index: 2

Mukul Gagrani

Citations: 456

h-index: 10

Chiris Lott

Citations: 0

h-index: 0

Narasimha Reddy

Citations: 30

h-index: 3

대규모 언어 모델(LLM) 추론의 프리필 단계는 긴 컨텍스트 워크로드에서 중요한 계산 병목 현상입니다. 짧은 컨텍스트 길이(1K~16K 토큰)에서, 피드포워드 네트워크(FFN)가 비용의 대부분을 차지하며, 총 FLOP(Floating-point Operations)의 상당 부분을 차지합니다. 기존의 FFN 희소화 방법은 주로 자기 회귀 디코딩을 위해 설계되었으며, 프리필 단계의 병렬성을 활용하지 못하고 종종 정확도를 저하시킵니다. 이러한 문제를 해결하기 위해, 우리는 블록 단위의 컨텍스트 인식 FFN 희소화를 통해 LLM 프리필을 가속화하는 예측 희소화 프레임워크인 FastForward를 소개합니다. FastForward는 (1) 각 블록에서 중요한 뉴런을 선택하는 경량 전문가 예측기, (2) 희소화로 인한 오류를 수정하는 오류 보상 네트워크, 그리고 (3) 토큰 혼합 중요도에 따라 컴퓨팅 자원을 할당하는 레이어별 희소화 스케줄러를 결합합니다. FastForward는 최대 80억 개의 파라미터를 가진 LLaMA 및 Qwen 모델에서 50%의 FFN 희소화 수준에서 최대 1.45배의 컴퓨팅 성능 향상을 제공하며, LongBench 벤치마크에서 밀집 모델 대비 정확도 손실이 6% 미만입니다. 이를 통해 제한된 하드웨어에서 효율적인 긴 컨텍스트 LLM 추론을 위한 첫 번째 토큰 생성 시간(TTFT)을 크게 줄입니다.

Original Abstract

The prefill stage of large language model (LLM) inference is a key computational bottleneck for long-context workloads. At short-to-moderate context lengths (1K--16K tokens), Feed-Forward Networks (FFNs) dominate this cost, accounting for most of the total FLOPs. Existing FFN sparsification methods, designed for autoregressive decoding, fail to exploit the prefill stage's parallelism and often degrade accuracy. To address this, we introduce FastForward, a predictive sparsity framework that accelerates LLM prefill through block-wise, context-aware FFN sparsity. FastForward combines (1) a lightweight expert predictor to select high-importance neurons per block, (2) an error compensation network to correct sparsity-induced errors, and (3) a layer-wise sparsity scheduler to allocate compute based on token-mixing importance. Across LLaMA and Qwen models up to 8B parameters, FastForward delivers up to 1.45$\times$ compute-bound speedup at 50% FFN sparsity with $<$ 6% accuracy loss compared to the dense baseline on LongBench, substantially reducing Time-to-First-Token (TTFT) for efficient, long-context LLM inference on constrained hardware.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!