2604.05526v1 Apr 07, 2026 cs.SD

경계 인식 정보 병목 구조를 이용한 제어 가능한 노래 스타일 변환

Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

Zhetao Hu

Citations: 1

h-index: 1

Yiquan Zhou

Citations: 17

h-index: 2

Wenyu Wang

Citations: 16

h-index: 2

Zhiyue Wu

Citations: 4

h-index: 1

Xin Gao

Citations: 68

h-index: 4

Jihua Zhu

Citations: 3,102

h-index: 28

본 논문에서는 S4 팀이 Singing Voice Conversion Challenge 2025 (SVCC2025)에 제출한 새로운 노래 스타일 변환 시스템을 소개합니다. 이 시스템은 동일 도메인 내에서 미세한 수준의 스타일 변환 및 제어를 향상시키는 것을 목표로 합니다. 제한된 데이터 환경에서 발생하는 스타일 누수, 동적 렌더링, 고충실도 생성이라는 중요한 문제들을 해결하기 위해, 우리는 세 가지 주요 혁신을 도입했습니다. 첫째, 음소 구간 표현을 활용하여 잔여 소스 스타일을 억제하면서 언어적 내용을 보존하는 경계 인식 쉭어(Whisper) 병목 구조입니다. 둘째, 추론 과정에서 목표 F0 처리를 통해 안정적이고 뚜렷한 동적 스타일 렌더링을 위한 명시적인 프레임 레벨 기법 행렬입니다. 셋째, 보조 표준 48kHz SVC 모델을 활용하여 고주파 스펙트럼을 보강하고, 데이터 부족 문제를 극복하면서 과적합을 방지하는 지각적으로 설계된 고주파 대역 완성 전략입니다. 공식 SVCC2025 주관적 평가에서, 당사의 시스템은 제출된 시스템 중 가장 자연스러운 성능을 달성했으며, 다른 최고 성능 시스템보다 훨씬 적은 추가 노래 데이터를 사용했음에도 불구하고, 화자 유사성 및 기법 제어 측면에서도 경쟁력 있는 결과를 유지했습니다. 오디오 샘플은 온라인에서 제공됩니다.

Original Abstract

This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.

1 Citations

0 Influential

14 Altmetric

71.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!