2603.00563v1 Feb 28, 2026 cs.SD

Whisper-MLA: MHA2MLA 변환을 통한 음성 인식 모델의 GPU 메모리 소비량 감소

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Jianguo Wei

Citations: 57

h-index: 4

Senyao Zhang

Citations: 3

h-index: 1

Wenhuan Lu

Citations: 18

h-index: 2

Xianghu Yue

Citations: 893

h-index: 11

Wei Li

Citations: 942

h-index: 10

Qiang Li

Citations: 13

h-index: 2

Pengcheng Zhao

Citations: 4

h-index: 1

Minghong Cai

Citations: 72

h-index: 2

Luo Si

Citations: 109

h-index: 2

Transformer 기반의 Whisper 모델은 음성 인식(ASR) 분야에서 최첨단 성능을 달성했습니다. 그러나, Whisper 모델의 Multi-Head Attention (MHA) 메커니즘은 Key-Value (KV) 캐시 사용량이 선형적으로 증가하여 상당한 GPU 메모리 소비를 유발하며, 이는 특히 장시간 오디오 데이터를 처리하는 많은 애플리케이션에서 문제가 됩니다. 이러한 문제를 해결하기 위해, 우리는 Whisper 모델에 Multi-Head Latent Attention (MLA)을 통합한 새로운 아키텍처인 Whisper-MLA를 제안합니다. 구체적으로, 우리는 MLA를 Whisper의 절대 위치 임베딩에 적용하고, 인코더 자체 주의, 디코더 자체 주의, 그리고 크로스-어텐션 모듈에 대한 MLA의 적용 가능성을 체계적으로 조사했습니다. 실험 결과는 MLA를 디코더 자체 주의에만 적용했을 때 성능과 메모리 효율성 간의 최적의 균형을 제공한다는 것을 보여줍니다. 제안하는 방법은 사전 학습된 Whisper 모델을 최소한의 추가 학습을 통해 Whisper-MLA로 변환할 수 있도록 합니다. LibriSpeech 벤치마크에 대한 광범위한 실험은 이 변환의 효과를 검증하며, Whisper-MLA가 KV 캐시 크기를 최대 87.5%까지 줄이면서도 경쟁력 있는 정확도를 유지한다는 것을 보여줍니다.

Original Abstract

The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!