2604.20267v1 Apr 22, 2026 cs.SD

ATIR: 오디오-텍스트 결합 컨텍스트 검색을 향하여

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

Yutao Zhu

University of Montreal

Citations: 4,486

h-index: 29

Zhicheng Dou

Citations: 2,131

h-index: 24

Tongtao Zhao

Citations: 5

h-index: 1

Chenghao Zhang

Citations: 277

h-index: 4

오디오는 텍스트보다 풍부한 정보를 담고 있으며, 감정, 화자 특징, 환경적 맥락 등을 포함하고 있으며, 또한 음성-텍스트 변환 파이프라인에 비해 낮은 지연 시간으로 처리할 수 있습니다. 그러나 최근 다중 모드 정보 검색 연구는 주로 이미지에 집중하는 경향이 있으며, 특히 오디오를 간과하는 경우가 많습니다. 특히 오디오-텍스트 결합 컨텍스트 검색이라는 맥락에서 더욱 그렇습니다. 본 연구에서는 오디오-텍스트 결합 컨텍스트 검색(ATIR)이라는 새로운 과제를 제안합니다. ATIR 과제는 쿼리가 오디오와 텍스트 모드를 번갈아 가며 사용하도록 설계되었습니다. 우리는 여러 자동 음성 인식(ASR), 질의응답(QA) 및 검색 데이터 세트를 통합하여 ATIR 벤치마크를 구축함으로써, 궁극적으로 네 가지 유형의 컨텍스트 검색 작업을 통합합니다. 이 벤치마크는 기존 오디오 검색 데이터 세트의 의미 검색 측면에서의 한계를 크게 해소합니다. 본 과제를 연구하기 위해, 우리는 몇 가지 기존 검색 모델을 평가하고, 다중 모드 대규모 언어 모델(MLLM)을 기반으로 ATIR 모델을 학습했습니다. 또한, 기존 압축 방법과 독립적인 새로운 토큰 압축 메커니즘을 도입하여, MLLM 기반 ATIR 모델에서 발생하는 과도한 오디오 토큰 문제를 완화합니다. 실험 결과는 ATIR 모델이 강력한 기준 모델보다 상당한 성능 향상을 달성했음을 보여줍니다.

Original Abstract

Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and train our ATIR model based on a Multimodal Large Language Model (MLLM). We further introduce a novel token compression mechanism that is orthogonal to existing compression methods, thereby alleviating the issue of excessive audio tokens in MLLM-based ATIR models. Experimental results demonstrate that our ATIR model achieves substantial improvements over strong baselines.

0 Citations

0 Influential

14.5 Altmetric

72.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!