2602.11761v1 Feb 12, 2026 cs.CL

MiniCPM-SALA: 효율적인 긴 문맥 모델링을 위한 희소 및 선형 어텐션의 하이브리드 결합

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

Siyuan Liu

Citations: 25

h-index: 3

Yudong Wang

Citations: 93

h-index: 4

Hengyu Zhao

Citations: 88

h-index: 4

Hongya Lyu

Citations: 71

h-index: 3

Min An

Citations: 21

h-index: 3

Yingfa Chen

Tsinghua University

Citations: 563

h-index: 9

Yewei Fang

Citations: 816

h-index: 4

Jiayi Li

Citations: 75

h-index: 5

Xin Li

Citations: 15

h-index: 3

Yaohui Li

Citations: 5

h-index: 2

Yishan Li

Citations: 9

h-index: 2

Yuxuan Li

Citations: 3

h-index: 1

Biyuan Lin

Citations: 87

h-index: 3

He Liu

Citations: 255

h-index: 6

Yinxu Pan

Citations: 74

h-index: 3

Shixin Ren

Citations: 6

h-index: 2

Xingyu Shen

Citations: 16

h-index: 3

Z. Bob Su

Citations: 9

h-index: 2

Hao Sun

Citations: 37

h-index: 3

Yan-Ting Sun

Citations: 7

h-index: 2

Z. Thai

Citations: 2,245

h-index: 6

Xin-Yu Tian

Citations: 5

h-index: 2

Rui Wang

Citations: 27

h-index: 3

Xiaorong Wang

Citations: 50

h-index: 3

Bo Wu

Citations: 5

h-index: 1

Xiaoyue Xu

Citations: 24

h-index: 2

Dongmei Xu

Citations: 15

h-index: 2

Shuaikang Xue

Citations: 3

h-index: 1

Jiawei Yang

Citations: 18

h-index: 2

Bowen Zhang

Citations: 5

h-index: 2

Jinqian Zhang

Citations: 183

h-index: 5

Letian Zhang

Citations: 474

h-index: 12

Shengnan Zhang

Citations: 31

h-index: 3

Xinyu Zhang

Citations: 7

h-index: 2

Zhuo Zhang

Citations: 15

h-index: 3

Jiachen Zhao

Citations: 33

h-index: 2

Jie Zhou

Citations: 146

h-index: 3

Shuo Wang

Citations: 10

h-index: 2

Xuelin Han

Citations: 5

h-index: 2

Zhiyuan Liu

Citations: 6

h-index: 2

Maosong Sun

Citations: 243

h-index: 7

Chaojun Xiao

Citations: 3,593

h-index: 24

Zihan Zhou

Citations: 89

h-index: 5

Chuang Liu

Citations: 146

h-index: 5

초장기 문맥(ultra-long contexts)을 처리하는 애플리케이션으로 대형 언어 모델(LLM)이 발전함에 따라, 트랜스포머(Transformer) 아키텍처의 높은 계산 및 메모리 비용으로 인한 과제에 직면하고 있다. 기존의 희소(sparse) 및 선형(linear) 어텐션 메커니즘이 이러한 문제를 완화하려고 시도하지만, 일반적으로 메모리 효율성과 모델 성능 간의 절충(trade-off)이 수반된다. 본 논문은 희소 어텐션(InfLLM-V2)의 고충실도 긴 문맥 모델링과 선형 어텐션(Lightning Attention)의 전역적 효율성을 통합한 90억(9B) 파라미터 규모의 하이브리드 아키텍처인 MiniCPM-SALA를 소개한다. 이 모델은 레이어 선택 알고리즘을 적용하여 두 메커니즘을 1:3 비율로 통합하고 하이브리드 위치 인코딩(HyPE)을 활용함으로써, 긴 문맥 작업에서의 효율성과 성능을 유지한다. 또한, 사전 학습된 트랜스포머 기반 모델을 하이브리드 모델로 변환하는 비용 효율적인 지속 학습(continual training) 프레임워크를 도입하여, 처음부터 학습하는 것에 비해 학습 비용을 약 75% 절감한다. 광범위한 실험을 통해 MiniCPM-SALA가 향상된 효율성을 제공하는 동시에 풀 어텐션(full-attention) 모델에 필적하는 일반 역량을 유지함을 보여준다. 단일 NVIDIA A6000D GPU 환경에서 이 모델은 256K 토큰의 시퀀스 길이에서 풀 어텐션 모델 대비 최대 3.5배의 추론 속도를 달성하며, 기존의 풀 어텐션 8B 모델이 메모리 한계로 실패하는 규모인 최대 100만(1M) 토큰의 문맥 길이를 지원한다.

Original Abstract

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.

3 Citations

0 Influential

12 Altmetric

63.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!