2605.03361v1 May 05, 2026 cs.AI

ReasonAudio: 텍스트-오디오 검색에서 매칭을 넘어선 추론 능력을 평가하기 위한 벤치마크

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

Yuting Chen

Citations: 55

h-index: 3

Siyue Zhang

Citations: 95

h-index: 5

Yilei Shi

Citations: 26

h-index: 3

Honglei Zhang

Citations: 0

h-index: 0

Chenpeng Hu

Citations: 1

h-index: 1

다양한 형태의 콘텐츠가 빠르게 증가함에 따라, 오디오 검색은 미디어 검색, 콘텐츠 구성 및 지능형 어시스턴트를 위한 핵심 기술로 부상했습니다. 그러나 대부분의 기존 벤치마크는 의미론적 매칭에 집중하고 있으며, 실제 사용자의 쿼리는 종종 부정 이해, 시간 순서 파악, 동시 발생 이벤트 인식, 지속 시간 구별 등 고급 추론 능력을 요구한다는 점을 반영하지 못합니다. 이러한 격차를 해소하기 위해, 우리는 텍스트-오디오 검색을 위한 최초의 추론 기반 벤치마크인 ReasonAudio를 소개합니다. ReasonAudio는 1,000개의 쿼리와 10,000개의 복합 오디오 클립으로 구성되어 있으며, 부정(Negation), 순서(Order), 중첩(Overlap), 지속 시간(Duration), 혼합(Mix)의 다섯 가지 기본적인 추론 작업을 포함합니다. 이러한 작업은 인간에게는 직관적이고 구성하기 쉬우지만, 현재 모델에게는 상당한 어려움을 제시합니다. 10개의 최첨단 모델에 대한 평가 결과, 모든 모델이 추론 기반 오디오 검색에서 어려움을 겪으며, 특히 부정과 지속 시간에 대한 성능이 저조하고, 중첩과 순서에 대해서는 상대적으로 더 나은 결과를 보였습니다. 또한, 멀티모달 대규모 언어 모델 기반의 임베딩 모델은 대비 학습을 통해 튜닝하더라도, 백본 모델의 추론 능력을 제대로 활용하지 못하며, 이는 현재의 학습 방식으로는 검색 환경에서 추론 능력을 유지하기 어렵다는 것을 시사합니다.

Original Abstract

As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models reveals the following findings: All models struggle with reasoning-intensive audio retrieval, performing particularly poorly on Negation and Duration while showing relatively better results on Overlap and Order. Moreover, Multimodal Large Language Model-based embedding models fail to inherit the reasoning capabilities of their backbones through contrastive fine-tuning, suggesting that current training paradigms are insufficient to preserve reasoning capacity in retrieval settings

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!