2605.04613v1 May 06, 2026 cs.SD

VocalParse: 대규모 오디오 언어 모델을 활용한 통합적이고 확장 가능한 노래 음성 전사 연구

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

E. Chng

Citations: 2,139

h-index: 24

Tianrui Wang

Citations: 190

h-index: 3

Yukun Chen

Citations: 7

h-index: 2

Zhaoxi Mu

Citations: 77

h-index: 5

Xinyu Yang

Citations: 32

h-index: 2

고품질의 노래 음성 데이터는 현대적인 노래 음성 합성(SVS) 시스템의 핵심 요소입니다. 그러나 수동 라벨링을 통해 이러한 데이터를 대규모로 확보하는 것은 상당한 노동력과 음악적 전문성이 필요하므로 비현실적이며, 자동 라벨링의 필요성이 매우 큽니다. 기존의 자동 전사 시스템은 유용하지만, 복잡한 다단계 파이프라인에 의존하는 경향이 있으며, 텍스트-음표 정렬 문제를 해결하는 데 어려움을 겪고, 또한 분포 외(OOD) 노래 데이터에 대한 일반화 성능이 좋지 않은 경우가 많습니다. 이러한 문제점을 해결하기 위해, 우리는 대규모 오디오 언어 모델(LALM)을 기반으로 구축된 통합적인 노래 음성 전사(SVT) 모델인 VocalParse를 제안합니다. 특히, 저희의 새로운 기여는 가사, 멜로디, 그리고 단어-음표 대응 관계를 동시에 모델링하는 인터리브된 프롬프트 방식을 도입하여, 구조화된 악보로 직접 매핑되는 생성 시퀀스를 얻는 것입니다. 또한, 가사를 먼저 의미론적 기반으로 디코딩하는 Chain-of-Thought(CoT) 스타일의 프롬프트 전략을 제안하여, 컨텍스트 파괴 문제를 크게 완화하면서 인터리브된 생성이 제공하는 구조적 이점을 유지합니다. 실험 결과, VocalParse는 여러 노래 데이터셋에서 최첨단 SVT 성능을 달성하는 것을 보여줍니다. 소스 코드 및 체크포인트는 https://github.com/pymaster17/VocalParse 에서 확인할 수 있습니다.

Original Abstract

High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain-of-Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state-of-the-art SVT performance on multiple singing datasets. The source code and checkpoint are available at https://github.com/pymaster17/VocalParse.

0 Citations

0 Influential

45.195286648076 Altmetric

226.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!