2601.12594v1 Jan 18, 2026 eess.AS

SLAP: 가변 길이 오디오와 다중 목적 훈련을 활용한 확장 가능한 언어-오디오 사전 훈련

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Xinhao Mei

Citations: 2,208

h-index: 15

Gaël Le Lan

Citations: 302

h-index: 9

Haohe Liu

Citations: 3,510

h-index: 26

Zhaoheng Ni

Citations: 150

h-index: 7

Varun Nagaraja

Citations: 41

h-index: 4

Yang Liu

Citations: 314

h-index: 4

Yangyang Shi

Citations: 538

h-index: 9

Vikas Chandra

Citations: 65

h-index: 2

대조 학습 기반 언어-오디오 사전 훈련(CLAP)은 의미적으로 풍부한 오디오 표현을 학습하는 데 상당한 성공을 거두었으며, 다양한 오디오 관련 작업에 널리 사용됩니다. 그러나 현재의 CLAP 모델은 몇 가지 중요한 한계를 가지고 있습니다. 첫째, 일반적으로 수백만 개의 오디오 샘플로 구성된 비교적 작은 데이터 세트로 훈련됩니다. 둘째, 기존의 CLAP 모델은 짧고 고정된 길이의 오디오에만 제한되어, 가변 길이 오디오가 사용되는 실제 환경에서의 활용을 제약합니다. 셋째, 표준적인 대조 학습 목표는 전역 표현을 기반으로 작동하며, 이는 밀집되고 세밀한 오디오 특징 학습을 방해할 수 있습니다. 이러한 문제점을 해결하기 위해, 우리는 가변 길이 오디오와 다중 훈련 목표를 통합하여 언어-오디오 사전 훈련을 확장한 Scalable Language-Audio Pretraining (SLAP)을 제안합니다. SLAP은 대조 손실과 함께 추가적인 자기 지도 학습 및 캡셔닝 손실을 단일 단계 훈련에서 통합하여, 더욱 풍부하고 밀집된 오디오 표현 학습을 가능하게 합니다. 제안하는 SLAP 모델은 오디오-텍스트 검색 및 제로샷 오디오 분류 작업에서 새로운 최고 성능을 달성하며, 다양한 벤치마크에서 그 효과를 입증합니다.

Original Abstract

Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.

0 Citations

0 Influential

13 Altmetric

65.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!