2603.06274v1 Mar 06, 2026 cs.LG

Stem: 희소 어텐션에서 인과 정보 흐름 재고

Stem: Rethinking Causal Information Flow in Sparse Attention

Yifu Sun

Citations: 452

h-index: 6

Guanghua Yu

Citations: 40

h-index: 4

Jianchen Zhu

Citations: 3

h-index: 1

Lin Niu

Citations: 379

h-index: 4

Xin Luo

Citations: 4

h-index: 1

Li Xie

Citations: 1

h-index: 1

S. K. Zhou

Citations: 211

h-index: 5

자기-어텐션의 2차 함수 형태의 계산 복잡도는 대규모 언어 모델(LLM)을 긴 문맥으로 확장하는 데 있어 근본적인 병목 현상입니다. 본 논문에서는 인과 어텐션 메커니즘을 정보 흐름의 관점에서 재검토합니다. 인과 제약 조건으로 인해 초기 위치의 토큰은 이후 모든 토큰의 집계에 참여합니다. 그러나 기존의 희소화 방법은 일반적으로 레이어 내의 모든 토큰 위치에 대해 균일한 상위 k개 토큰을 선택하지만, 인과 아키텍처에 내재된 토큰 정보의 누적적 의존성을 무시합니다. 이러한 문제를 해결하기 위해, 정보 흐름에 맞춰 설계된 새로운 플러그 앤 플레이 희소화 모듈인 Stem을 제안합니다. 먼저, Stem은 토큰 위치에 따른 감소 전략(Token Position-Decay)을 사용하여 각 레이어 내에서 위치에 따라 달라지는 상위 k개 토큰을 선택함으로써, 재귀적 의존성을 위해 초기 토큰을 유지합니다. 둘째, Stem은 정보가 풍부한 토큰을 보존하기 위해 출력 기반 지표(Output-Aware Metric)를 사용합니다. Stem은 근사적인 출력 크기를 기반으로 영향력이 큰 토큰을 우선적으로 선택합니다. 광범위한 실험 결과는 Stem이 계산량과 사전 채우기 지연 시간을 줄이면서도 우수한 정확도를 달성함을 보여줍니다.

Original Abstract

The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling Large Language Models (LLMs) to long contexts, particularly during the pre-filling phase. In this paper, we rethink the causal attention mechanism from the perspective of information flow. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a uniform top-k selection across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose Stem, a novel, plug-and-play sparsity module aligned with information flow. First, Stem employs the Token Position-Decay strategy, applying position-dependent top-k within each layer to retain initial tokens for recursive dependencies. Second, to preserve information-rich tokens, Stem utilizes the Output-Aware Metric. It prioritizes high-impact tokens based on approximate output magnitude. Extensive evaluations demonstrate that Stem achieves superior accuracy with reduced computation and pre-filling latency.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!