2605.14231v1 May 14, 2026 cs.LG

AudioMosaic: 대비 학습 기반의 음성 표현 학습

AudioMosaic: Contrastive Masked Audio Representation Learning

Hanxun Huang

The University of Melbourne

Citations: 1,305

h-index: 13

Cihang Xie

Citations: 392

h-index: 8

Qizhou Wang

Citations: 71

h-index: 3

Christopher Leckie

Citations: 12

h-index: 2

Xingjun Ma

Citations: 142

h-index: 6

Sarah M. Erfani

Citations: 174

h-index: 5

음성 자기 지도 학습(SSL)은 대규모의 레이블이 없는 음성 데이터로부터 범용적인 표현을 학습하는 것을 목표로 합니다. 최근의 발전은 주로 생성적 재구성 목적에 의해 주도되었지만, 대비 학습 방법은 효과적인 음성 증강 기법을 설계하는 어려움과 대비 사전 학습에 필요한 큰 배치 크기 때문에 상대적으로 덜 연구되었습니다. 본 논문에서는 일반적인 음성 이해를 위한 대비 학습 기반의 음성 인코더인 extbf{AudioMosaic}을 소개합니다. 사전 학습 과정에서 AudioMosaic은 구조화된 시간-주파수 마스킹을 사용하여 스펙트로그램 패치에 대해 양수 쌍을 구성하며, 이는 메모리 사용량을 줄이고 효율적인 대규모 배치 학습을 가능하게 합니다. 생성적 접근 방식과 비교했을 때, AudioMosaic 인코더는 더 구별적인 음성 수준 표현을 학습하며, 이는 데이터셋, 도메인 및 음향 조건에 걸쳐 강력한 전이성을 보여줍니다. 광범위한 실험 결과, AudioMosaic은 선형 프로빙 및 미세 조정 모두에서 여러 표준 음성 벤치마크에서 최첨단 성능을 달성합니다. 또한, 사전 학습된 AudioMosaic 인코더를 음성-언어 모델에 통합하면 음성-언어 작업의 성능이 향상된다는 것을 보여줍니다. 코드 및 관련 자료는 저희 GitHub 저장소에서 공개적으로 이용 가능합니다: [https://github.com/HanxunH/AudioMosaic](https://github.com/HanxunH/AudioMosaic)

Original Abstract

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

0 Citations

0 Influential

35.45879734614 Altmetric

177.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!