2601.19399v1 Jan 27, 2026 cs.SD

잔여 토큰이 음성 모델링을 위한 마스크 자동 인코더를 향상시키는 방법

Residual Tokens Enhance Masked Autoencoders for Speech Modeling

Xavier Alameda-Pineda

Citations: 4,605

h-index: 33

Samir Sadok

Citations: 96

h-index: 5

Stéphane Lathuilière

Citations: 553

h-index: 9

최근 음성 모델링은 음높이, 내용 및 화자 정보와 같은 명시적인 특징에 의존하지만, 이러한 특징만으로는 자연스러운 음성의 모든 풍부함을 담아낼 수 없습니다. 본 논문에서는 RT-MAE라는 새로운 마스크 자동 인코더 프레임워크를 소개합니다. RT-MAE는 지도 학습 기반의 특징 모델링에 추가적으로 비지도 학습 기반의 잔여 학습 가능한 토큰을 활용하여, 명시적으로 레이블링된 요인으로 설명되지 않는 정보를 인코딩합니다 (예: 음색 변화, 노이즈, 감정 등). 실험 결과, RT-MAE는 재구성 품질을 향상시키고, 내용과 화자 유사성을 유지하면서 표현력을 향상시키는 것을 확인했습니다. 또한, RT-MAE는 추론 시 노이즈 제거에 적용 가능하며, 제어 가능성과 자연스러움을 유지하는 것을 입증했습니다.

Original Abstract

Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.

0 Citations

0 Influential

16.5 Altmetric

82.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!