2601.19611v1 Jan 27, 2026 cs.LG

대규모 언어 모델에서의 헤드 간 상호 작용을 위한 명시적 멀티 헤드 어텐션

Explicit Multi-head Attention for Inter-head Interaction in Large Language Models

Yunhua Zhou

Citations: 1,046

h-index: 15

Xipeng Qiu

Citations: 3

h-index: 1

Qipeng Guo

Citations: 544

h-index: 13

Runyu Peng

Citations: 100

h-index: 4

Demin Song

Citations: 835

h-index: 10

Kai Lv

Citations: 921

h-index: 9

Bo Wang

Citations: 92

h-index: 5

트랜스포머 아키텍처를 기반으로 구축된 대규모 언어 모델에서, 최근 연구들은 헤드 간 상호 작용이 어텐션 성능을 향상시킬 수 있음을 보여주었습니다. 이러한 동기 부여를 바탕으로, 우리는 헤드 간 상호 작용을 명시적으로 모델링하는 간단하면서도 효과적인 어텐션 변형인 멀티 헤드 명시적 어텐션 (MEA)을 제안합니다. MEA는 두 가지 핵심 구성 요소로 구성됩니다. 첫째, 헤드 레벨 선형 조합 (HLC) 모듈은 학습 가능한 선형 조합을 각 헤드의 키와 값 벡터에 개별적으로 적용하여 풍부한 헤드 간 통신을 가능하게 합니다. 둘째, 헤드 레벨 그룹 정규화 레이어는 재조합된 헤드의 통계적 특성을 정렬합니다. MEA는 사전 학습 단계에서 강력한 안정성을 보여주며, 더 큰 학습률을 사용하여 더 빠른 수렴을 가능하게 하여 궁극적으로 검증 손실을 줄이고 다양한 작업에서 성능을 향상시킵니다. 또한, 우리는 어텐션 헤드의 수를 줄이고 HLC를 활용하여 저랭크 "가상 헤드"를 사용하여 이를 재구성함으로써 MEA의 파라미터 효율성을 탐구했습니다. 이를 통해 실용적인 키-값 캐시 압축 전략을 구현하여 지식 집약적 및 과학적 추론 작업에서 KV-캐시 메모리 사용량을 50% 줄이고, 올림피아드 수준의 수학 벤치마크에서 3.59%의 정확도 감소만 발생하는 것을 확인했습니다.

Original Abstract

In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank "virtual heads". This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!