2605.05611v1 May 07, 2026 cs.SD

X-Voice: 제로샷 교차 언어 음성 복제를 통해 누구나 30개 언어를 말할 수 있도록 지원하는 시스템

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Qinyuan Cheng

Fudan Universality

Citations: 868

h-index: 17

Xipeng Qiu

Citations: 16

h-index: 3

Yushen Chen

Citations: 477

h-index: 4

Kai Yu

Citations: 414

h-index: 9

Zhikang Niu

Citations: 831

h-index: 10

Xie Chen

Citations: 16

h-index: 2

Qingyun Liu

Citations: 43

h-index: 2

Haitao Li

Citations: 63

h-index: 2

Yunting Yang

Citations: 0

h-index: 0

Jianxuan Zhao

Citations: 1

h-index: 1

Berrak Sisman

Citations: 2,477

h-index: 26

Rixin Xu

Citations: 3

h-index: 1

Ke Li

Citations: 64

h-index: 2

본 논문에서는 0.4B 파라미터 규모의 다국어 제로샷 음성 복제 모델인 X-Voice를 소개합니다. X-Voice는 임의의 음성을 복제하고 누구나 30개 언어를 사용할 수 있도록 지원합니다. X-Voice는 국제 음성 기호(IPA)를 통합 표현으로 사용하여 420K 시간 분량의 다국어 데이터셋으로 학습되었습니다. 복잡한 전처리 과정(예: 강제 정렬) 없이 프롬프트 텍스트에 대한 의존성을 없애기 위해, 우리는 두 단계의 학습 방법을 설계했습니다. 1단계에서는 표준 조건부 플로우 매칭 학습을 통해 X-Voice$_{ ext{s1}}$을 구축하고, 이를 사용하여 10K 시간 분량의 화자 일관성을 갖는 음성 세그먼트를 오디오 프롬프트로 생성합니다. 2단계에서는 이러한 오디오-텍스트 쌍을 사용하여 프롬프트 텍스트를 마스킹하고 X-Voice$_{ ext{s2}}$를 미세 조정합니다. 이를 통해 오디오 프롬프트의 전사본 없이도 제로샷 음성 복제가 가능합니다. 구조적으로, X-Voice는 F5-TTS를 확장하여 다국어 음성 합성을 용이하게 하기 위해 언어 식별자를 이중 수준으로 주입하고, 분류기-프리 가이던스의 결합 및 스케줄링을 분리했습니다. 주관적 및 객관적 평가 결과는 X-Voice가 LEMAS-TTS와 같은 기존의 플로우 매칭 기반 다국어 시스템보다 우수한 성능을 보이며, Qwen3-TTS와 같은 수십억 파라미터 규모의 모델과 유사한 제로샷 교차 언어 복제 기능을 달성함을 보여줍니다. 연구의 투명성을 높이고 커뮤니티 발전을 촉진하기 위해, 관련된 모든 리소스를 공개합니다.

Original Abstract

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

0 Citations

0 Influential

13 Altmetric

65.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!