2602.16008v1 Feb 17, 2026 cs.SD

MAEB: 대규모 오디오 임베딩 벤치마크

MAEB: Massive Audio Embedding Benchmark

Adnan El Assadi

Citations: 1

h-index: 1

Isaac Chung

Citations: 128

h-index: 4

Chenghao Xiao

Citations: 188

h-index: 6

Roman Solomatin

Citations: 22

h-index: 3

Animesh Jha

Citations: 72

h-index: 4

Rahul Chand

Citations: 18

h-index: 3

Silky Singh

Citations: 5

h-index: 1

Kaitlyn Wang

Citations: 0

h-index: 0

A. S. Khan

Citations: 11

h-index: 2

M. Nasser

Citations: 213

h-index: 5

Sufen Fong

Citations: 0

h-index: 0

Pengfei He

Citations: 66

h-index: 2

Alan Xiao

Citations: 0

h-index: 0

A. Munot

Citations: 0

h-index: 0

A. Shrivastava

Citations: 61

h-index: 3

Niklas Muennighoff

Citations: 28

h-index: 3

Kenneth C. Enevoldsen

Citations: 197

h-index: 5

A. Gazizov

Citations: 3

h-index: 1

본 논문에서는 100개 이상의 언어에서 음성, 음악, 환경 소리, 그리고 다양한 모달리티(오디오-텍스트) 간의 추론을 포함하는 30가지 작업을 포괄하는 대규모 벤치마크인 Massive Audio Embedding Benchmark (MAEB)를 소개합니다. 50개 이상의 모델을 평가한 결과, 어떤 단일 모델도 모든 작업에서 우수한 성능을 보이지 않았습니다. 대비 학습 기반 오디오-텍스트 모델은 환경 소리 분류(예: ESC50) 작업에서는 뛰어난 성능을 보이지만, 다국어 음성 작업(예: SIB-FLEURS)에서는 무작위 수준의 성능에 머무르는 반면, 음성 사전 학습 모델은 그 반대의 경향을 보입니다. 모든 모델에게 클러스터링은 여전히 어려운 과제이며, 가장 성능이 좋은 모델조차도 미미한 수준의 결과만을 달성했습니다. 음향 이해 능력이 뛰어난 모델은 언어 관련 작업에서, 반대로 언어 관련 능력이 뛰어난 모델은 음향 관련 작업에서 성능이 저조한 경향을 보입니다. 또한, MAEB에서 오디오 인코더의 성능은 오디오 대규모 언어 모델에 사용될 때의 성능과 높은 상관관계를 갖는다는 것을 확인했습니다. MAEB는 98가지 작업으로 구성된 MAEB+에서 파생되었으며, 작업의 다양성을 유지하면서 평가 비용을 줄이고, 텍스트, 이미지, 오디오 모달리티 전반에 걸쳐 통일된 평가를 가능하게 하기 위해 MTEB 생태계에 통합되었습니다. MAEB와 98가지 작업, 코드, 그리고 리더보드는 https://github.com/embeddings-benchmark/mteb 에서 공개됩니다.

Original Abstract

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

0 Citations

0 Influential

63 Altmetric

315.0 Score

Original PDF

3,144

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!