2603.28086v1 Mar 30, 2026 cs.SD

MOSS-VoiceGenerator: 자연어 설명을 활용한 사실적인 음성 생성

MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

Chenchen Yang

Citations: 52

h-index: 4

Liwei Fan

Citations: 22

h-index: 2

Zhaoye Fei

Citations: 1,048

h-index: 12

Qinyuan Cheng

Fudan Universality

Citations: 868

h-index: 17

Shimin Li

Citations: 397

h-index: 11

Qian Tu

Citations: 88

h-index: 2

Kexin Huang

Citations: 47

h-index: 4

Botian Jiang

Citations: 279

h-index: 6

Ya Jiang

Citations: 68

h-index: 6

Yiwei Zhao

Citations: 13

h-index: 2

Xiaogui Yang

Citations: 526

h-index: 6

Xipeng Qiu

Citations: 16

h-index: 3

Jie Zhu

Citations: 31

h-index: 3

Yuqian Zhang

Citations: 3

h-index: 1

자연어 기반 음성 설계는 사용자가 특정 역할, 성격 및 감정에 맞춰 음성을 생성할 수 있도록, 자유 형식의 텍스트 설명을 기반으로 화자 음색을 직접 생성하는 것을 목표로 합니다. 이러한 제어 가능한 음성 생성은 스토리텔링, 게임 더빙, 역할극 에이전트, 대화형 어시스턴트 등 다양한 응용 분야에 유용하며, 현대 텍스트 음성 변환 모델에서 중요한 과제입니다. 그러나 기존 모델은 대부분 정교하게 녹음된 스튜디오 데이터를 기반으로 훈련되어, 발음이 명확하고 깨끗한 음성을 생성하지만 실제 인간의 음성이 가진 자연스러움이 부족합니다. 이러한 한계를 극복하기 위해, 우리는 자연어 프롬프트로부터 새로운 음색을 직접 생성하는 오픈 소스 기반 음성 생성 모델인 MOSS-VoiceGenerator를 제시합니다. 실제 환경의 음향 변화에 노출되면 더욱 자연스러운 음성이 생성된다는 가설에 따라, 우리는 영화 콘텐츠에서 수집한 대규모 표현적 음성 데이터를 사용하여 모델을 훈련했습니다. 주관적인 선호도 연구 결과, MOSS-VoiceGenerator는 전반적인 성능, 지시사항 준수 및 자연스러움 측면에서 다른 음성 설계 모델보다 우수한 것으로 나타났습니다.

Original Abstract

Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.

2 Citations

0 Influential

8.5 Altmetric

44.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!