2605.01910v1 May 03, 2026 cs.LG

메모리 병목 현상 완화를 위한 확률적 희소 어텐션

Stochastic Sparse Attention for Memory-Bound Inference

Can Yaras

Citations: 202

h-index: 7

Samet Oymak

Citations: 5,792

h-index: 40

Kyle Lee

Citations: 22

h-index: 2

Corentin Delacour

Citations: 336

h-index: 10

Kevin Callahan-Coray

Citations: 3

h-index: 1

Kyle Jiang

Citations: 97

h-index: 3

T. Srimani

Citations: 18

h-index: 2

K. Camsari

Citations: 36

h-index: 3

자기 회귀 디코딩은 긴 문맥에서 대역폭 제한에 직면하는데, 이는 각 토큰을 생성하는 데 필요한 모든 $n_k$개의 키(key) 및 값(value) 벡터를 KV 캐시에서 읽어야 하기 때문입니다. 본 논문에서는 Stochastic Additive No-mulT Attention (SANTA)이라는 방법을 제시합니다. SANTA는 소프트맥스 분포에서 $S acksim n_k$개의 인덱스를 샘플링하여 값 캐시 접근을 희소화하고, 샘플링된 인덱스에 해당하는 값 행만 집계합니다. 이를 통해 소프트맥스 이후의 값 집계에 대한 편향 없는 추정치를 제공하며, 값 단계의 곱셈-누적 연산을 게더-앤-애드 연산으로 대체합니다. 본 논문에서는 분산 감소 및 GPU 친화적인 SANTA 변형을 설계하기 위해 계층화 샘플링을 사용하며, NVIDIA RTX 6000 Ada에서 FlashInfer 및 FlashDecoding보다 1.5배 빠른 어텐션 커널 속도를 달성하고, 32k 토큰의 문맥에서 기본 정확도와 일치하는 것을 보입니다. 또한, 점수 계산 단계를 희소화하기 위한 상호 보완적인 기술로 베르누이 $qK^ ext{T}$ 샘플링을 제안합니다. 이는 확률적 삼항 쿼리를 통해 키-특징 접근을 줄입니다. 이러한 방법들은 삼항 양자화, 저랭크 투영 및 KV 캐시 압축과 같은 기존 기술과 독립적으로 사용될 수 있습니다. 이러한 기술들은 희소하고, 곱셈이 필요 없으며, 에너지 효율적인 추론을 가능하게 합니다. 개발된 커널은 다음 GitHub 저장소에서 확인할 수 있습니다: https://github.com/OPUSLab/SANTA.git

Original Abstract

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git

0 Citations

0 Influential

40 Altmetric

200.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!