2602.07036v1 Feb 03, 2026 cs.SD

MENASpeechBank: 페르소나 기반의 다중 턴 대화 데이터셋으로 구성된 음성 데이터베이스 - 오디오 LLM을 위한 참조 데이터

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

Zien Sheikh Ali

Citations: 2

h-index: 1

Hunzalah Hassan Bhatti

Citations: 7

h-index: 2

R. N. Nandi

Citations: 171

h-index: 8

S. Chowdhury

Citations: 21

h-index: 2

Firoj Alam

Citations: 22

h-index: 2

오디오 대규모 언어 모델(AudioLLMs)은 음성 및 일반 오디오를 통해 명령을 수행할 수 있지만, 다양한 대화형, 명령 기반의 음성-텍스트 데이터 부족으로 인해 발전 속도가 점점 더뎌지고 있습니다. 특히 페르소나 기반 상호 작용 및 방언 커버리지 측면에서 실제 다중 화자 녹음 데이터를 수집하고 공개하는 것은 비용이 많이 들고 시간이 오래 걸립니다. 본 연구에서는 MENASpeechBank를 소개합니다. MENASpeechBank는 124명의 화자가 참여하여 영어, 현대 표준 아랍어(MSA), 그리고 지역 아랍어 방언을 포함하는 약 18,000개의 고품질 발화를 포함하는 참조 음성 데이터베이스입니다. 이 리소스를 기반으로, 우리는 다음과 같은 기능을 갖춘 제어 가능한 합성 데이터 파이프라인을 개발했습니다. (i) World Values Survey에서 영감을 받은 속성을 통해 풍부하게 구성된 페르소나 프로필 생성, (ii) 약 5,000개의 대화 시나리오 분류 정의, (iii) 의미 유사성을 통해 페르소나와 시나리오 매칭, (iv) 사용자가 페르소나의 역할을 수행하고, 어시스턴트가 도움이 되는 에이전트 역할을 수행하는 약 417,000개의 역할극 대화 생성, (v) 참조 화자의 오디오를 사용하여 사용자 발화를 합성하여 화자 정보 및 다양성을 유지. 우리는 합성 데이터와 실제 녹음 데이터를 모두 평가하고 상세한 분석 결과를 제공합니다. MENASpeechBank와 생성된 대화 데이터를 커뮤니티에 공개할 예정입니다.

Original Abstract

Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!