2602.07803v1 Feb 08, 2026 eess.AS

SoulX-Singer: 고품질 제로샷 가창 음성 합성 연구

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

J. Qian

Citations: 5

h-index: 1

Hao Meng

Citations: 2

h-index: 1

Tian Zheng

Citations: 3

h-index: 1

Pengcheng Zhu

Citations: 7

h-index: 1

Haopeng Lin

Citations: 6

h-index: 1

Yuhang Dai

Citations: 102

h-index: 5

Hanke Xie

Citations: 28

h-index: 3

Wenxiao Cao

Citations: 6

h-index: 1

Ruixuan Shang

Citations: 5

h-index: 1

Jun Wu

Citations: 118

h-index: 4

Hongmei Liu

Citations: 5

h-index: 1

Hanlin Wen

Citations: 83

h-index: 4

Jian Zhao

Citations: 397

h-index: 3

Zhonglin Jiang

Citations: 38

h-index: 3

Yong Chen

Citations: 12

h-index: 2

Shunshun Yin

Citations: 15

h-index: 3

Ming Tao

Citations: 13

h-index: 2

Jianguo Wei

Citations: 57

h-index: 4

Lei Xie

Citations: 39

h-index: 3

Xinsheng Wang

Citations: 15

h-index: 3

최근 연도 동안 음성 합성 분야는 빠른 발전을 이루었지만, 오픈 소스 가창 음성 합성(SVS) 시스템은 여전히 산업적 적용에 있어 안정성과 제로샷 일반화 능력 측면에서 상당한 어려움을 겪고 있습니다. 본 연구에서는 실제 적용을 고려하여 설계된 고품질 오픈 소스 SVS 시스템인 SoulX-Singer를 소개합니다. SoulX-Singer는 MIDI 또는 멜로디 표현을 기반으로 제어 가능한 가창 생성 기능을 지원하여, 실제 제작 워크플로우에서 유연하고 풍부한 표현을 가능하게 합니다. 42,000시간 이상의 보컬 데이터로 학습된 SoulX-Singer는 중국어(Mandarin), 영어, 광둥어(Cantonese)를 지원하며, 다양한 음악적 조건 하에서 최고 수준의 합성 품질을 꾸준히 달성합니다. 또한, 실제 시나리오에서 제로샷 SVS 성능을 신뢰성 있게 평가할 수 있도록, 엄격한 학습-검증 데이터 분리를 적용한 전용 벤치마크인 SoulX-Singer-Eval을 구축하여 체계적인 평가를 지원합니다.

Original Abstract

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!