2603.06199v1 Mar 06, 2026 cs.CL

FlashPrefill: 초고속 컨텍스트 프리필링을 위한 즉각적인 패턴 발견 및 임계값 설정

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

Qihang Fan

Citations: 428

h-index: 8

Huaibo Huang

Citations: 3,298

h-index: 29

Zhiying Wu

Citations: 95

h-index: 2

Juqiu Wang

Citations: 9

h-index: 2

Bingning Wang

Citations: 33

h-index: 2

Ran He

Citations: 404

h-index: 8

장문 컨텍스트 모델링은 대규모 언어 모델의 핵심 기능이지만, 어텐션 연산의 2차 복잡성은 특히 계산 집약적인 프리필링 단계에서 중요한 병목 현상입니다. 다양한 희소 어텐션 메커니즘이 연구되었지만, 일반적으로 상당한 검색 지연 시간이나 불충분한 희소성을 겪습니다. 본 논문에서는 초고속 프리필링을 가능하게 하는 프레임워크인 FlashPrefill을 제안합니다. FlashPrefill은 빠른 블록 검색 기술을 활용하여 동적인 수직, 대각선, 블록 희소 어텐션 패턴을 동시에 찾아냅니다. 특히, FlashPrefill은 어텐션 점수를 정렬하거나 누적하는 데 드는 과도한 오버헤드를 피하면서, 꼬리 분포를 효과적으로 제거하여 희소성을 향상시키는 동적 임계값 설정 메커니즘을 도입합니다. 광범위한 실험 결과는 FlashPrefill이 효율성 측면에서 상당한 발전을 이루었으며, 256K 시퀀스에서 전례 없는 27.78배의 속도 향상을 달성했음을 보여줍니다. 주목할 점은 기존 방법들이 짧은 컨텍스트에서 효율성이 저하되는 반면, FlashPrefill은 4K 컨텍스트 길이에서도 1.71배의 속도 향상을 유지하여 다양한 시퀀스 크기에서 견고하고 실용적인 유용성을 입증합니다.

Original Abstract

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

0 Citations

0 Influential

14.5 Altmetric

72.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!