2604.24432v1 Apr 27, 2026 cs.CL

콰이 요약 어텐션 기술 보고서

Kwai Summary Attention Technical Report

Ruiming Tang

Citations: 64

h-index: 3

Han Li

Citations: 5

h-index: 1

Ziming Li

Citations: 414

h-index: 11

Chenglong Chu

Citations: 79

h-index: 3

Guowang Zhang

Citations: 71

h-index: 3

Hong Cheng

Citations: 110

h-index: 4

Jian Liang

Citations: 38

h-index: 1

Kun Gai

Citations: 5,625

h-index: 19

Ling Zhou

Citations: 16

h-index: 1

Lu Ren

Citations: 39

h-index: 3

Xinchen Luo

Citations: 251

h-index: 8

Yinwu Su

Citations: 0

h-index: 0

Boyang Ding

Citations: 172

h-index: 3

Dunju Zang

Citations: 76

h-index: 3

Jiao Ou

Citations: 71

h-index: 5

Jiaxin Deng

Citations: 266

h-index: 5

Ji-jin Shi

Citations: 1

h-index: 1

Junmin Chen

Citations: 8

h-index: 1

Lejian Ren

Citations: 301

h-index: 6

Minxuan Lv

Citations: 41

h-index: 4

Qianqian Wang

Citations: 58

h-index: 3

Qigen Hu

Citations: 0

h-index: 0

Shiyao Wang

Citations: 289

h-index: 5

Si-Zhu Mao

Citations: 15

h-index: 1

Tao Wang

Citations: 139

h-index: 5

Zhixin Ling

Citations: 90

h-index: 4

Zixing Zhang

Citations: 47

h-index: 2

Guorui Zhou

Citations: 44

h-index: 3

Hao Peng

Citations: 386

h-index: 3

Jiangxia Cao

Citations: 231

h-index: 9

Qi Zhang

Citations: 29

h-index: 2

Ruitao Wang

Citations: 26

h-index: 1

Zhiyuan Liang

Citations: 43

h-index: 5

Ziqi Wang

Citations: 45

h-index: 4

Chengru Song

Citations: 277

h-index: 6

Hui Wang

Citations: 15

h-index: 2

Jinghao Zhang

Citations: 60

h-index: 3

Xing Wang

Citations: 78

h-index: 5

장문 맥락 처리 능력은 차세대 대규모 언어 모델의 가장 중요한 발전 방향 중 하나이며, 특히 의미 이해/추론, 코드 기반 지능, 추천 시스템에서 중요한 역할을 합니다. 그러나 표준 소프트맥스 어텐션은 시퀀스 길이에 대해 2차 시간 복잡도를 가지므로, 시퀀스 길이가 증가함에 따라 장문 맥락 환경에서 상당한 오버헤드가 발생하여 매우 긴 시퀀스의 학습 및 추론 비용이 급격히 증가합니다. 기존의 해결 방안은 크게 두 가지 기술적 접근 방식을 사용합니다. 첫째, 레이어별로 헤드 수준 압축(GQA) 또는 임베딩 차원 수준 압축(MLA)과 같이 KV 캐시 크기를 줄이는 방식이지만, KV 캐시는 여전히 시퀀스 길이에 대해 1:1 비율로 선형적인 의존성을 가집니다. 둘째, 로컬 어텐션(SWA) 또는 선형 커널(GDN)과 같이 KV 캐시에 친화적인 아키텍처를 사용하는 방식이지만, 종종 KV 캐시와 장문 맥락 모델링 효과 사이의 균형을 맞추는 어려움이 있습니다. 이 두 가지 방식 외에도, 본 논문에서는 아직 충분히 탐구되지 않은 중간 경로가 존재한다고 주장합니다. 바로, "KV 캐시와 시퀀스 길이에 선형적인 관계를 유지하면서 특정 비율 $k$를 통해 의미 수준의 압축을 수행하는 방식"입니다. 이 $O(n/k)$ 방식은 "최소 KV 캐시"를 추구하는 것이 아니라, 허용 가능한 메모리 비용을 감수하고 장거리 의존성을 완전하고 참조 가능하며 해석 가능하게 유지하는 것을 목표로 합니다. 이러한 동기를 바탕으로, 본 논문에서는 과거 맥락을 학습 가능한 요약 토큰으로 압축하여 시퀀스 모델링 비용을 줄이는 새로운 어텐션 메커니즘인 콰이 요약 어텐션(KSA)을 제안합니다.

Original Abstract

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!