2603.15619v1 Mar 16, 2026 cs.CL

심층 혼합 어텐션 (Mixture-of-Depths Attention)

Mixture-of-Depths Attention

Yujia Liu

Citations: 0

h-index: 0

Shijie Wang

Alibaba DAMO Academy

Citations: 8,131

h-index: 6

Lianghui Zhu

Citations: 2,368

h-index: 8

Bencheng Liao

Citations: 4,617

h-index: 15

Tianheng Cheng

Citations: 12,165

h-index: 22

Zilong Huang

Citations: 8

h-index: 2

Yi Lin

Citations: 56

h-index: 2

Xinggang Wang

Citations: 2,871

h-index: 11

Yuxin Fang

Citations: 1,585

h-index: 6

Chen Chen

Citations: 28

h-index: 4

Yutao Zeng

Bytedance

Citations: 679

h-index: 12

Ya Wang

Citations: 60

h-index: 3

Yu Li

Citations: 1

h-index: 1

심층 구조는 거대 언어 모델(LLM)의 성능 향상에 중요한 역할을 합니다. 그러나 LLM이 깊어질수록, 신호 감쇠 현상이 발생합니다. 즉, 얕은 레이어에서 형성된 유용한 특징들이 반복적인 잔차 업데이트 과정에서 점차 희석되어, 깊은 레이어에서 복구하기 어려워집니다. 본 논문에서는 각 어텐션 헤드가 현재 레이어의 시퀀스 KV 쌍과 함께 이전 레이어의 심층 KV 쌍에 접근할 수 있도록 하는 '심층 혼합 어텐션 (MoDA)'이라는 메커니즘을 제안합니다. 또한, MoDA의 효율적인 하드웨어 구현 알고리즘을 제시하여 비연속적인 메모리 접근 패턴을 해결하고, 시퀀스 길이가 64K일 때 FlashAttention-2의 효율성의 97.3%를 달성합니다. 15억 파라미터 모델에 대한 실험 결과, MoDA는 강력한 기준 모델들을 꾸준히 능가하는 성능을 보였습니다. 특히, 10개의 검증 벤치마크에서 평균 퍼플렉시티를 0.2만큼 향상시키고, 10개의 다운스트림 태스크에서 평균 성능을 2.11% 향상시키는 동시에, 계산량(FLOPs)은 3.7%의 미미한 증가만 발생했습니다. 또한, MoDA를 'post-norm'과 함께 사용할 때 'pre-norm'과 함께 사용하는 것보다 더 나은 성능을 보이는 것을 확인했습니다. 이러한 결과는 MoDA가 심층 구조 확장에 유망한 기술임을 시사합니다. 코드 및 관련 자료는 다음 링크에서 확인할 수 있습니다: https://github.com/hustvl/MoDA

Original Abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

1 Citations

0 Influential

52.520325466021 Altmetric

263.6 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!