2601.14777v1 Jan 21, 2026 cs.CV

FunCineForge: 다양한 영화적 장면에서의 제로샷 영화 더빙을 위한 통합 데이터 도구 키트 및 모델

FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

Jiaxuan Liu

Citations: 13

h-index: 2

Yang Xiang

Citations: 29

h-index: 3

Han Zhao

Citations: 13

h-index: 3

Xiangang Li

Citations: 196

h-index: 7

Zhenhua Ling

Citations: 4

h-index: 2

영화 더빙은 비디오 장면을 기반으로 음성을 합성하는 작업으로, 정확한 입술 동기화, 충실한 음색 전달, 그리고 캐릭터의 개성과 감정을 적절하게 모델링하는 것이 필요합니다. 그러나 기존 방법은 다음과 같은 두 가지 주요 한계점을 가지고 있습니다. (1) 고품질의 멀티모달 더빙 데이터셋은 규모가 제한적이고, 높은 오류율을 가지고 있으며, 희소한 주석을 포함하고, 비용이 많이 드는 수동 레이블링을 필요로 하며, 단일 대사 장면으로 제한되어 있어 효과적인 모델 학습을 방해합니다. (2) 기존의 더빙 모델은 오디오-비디오 정렬을 학습하기 위해 오직 입술 영역에만 의존하며, 이는 복잡한 실사 영화적 장면에는 적용하기 어렵고, 입술 동기화, 음성 품질, 감정 표현 측면에서 최적의 성능을 보이지 않습니다. 이러한 문제를 해결하기 위해, 우리는 대규모 더빙 데이터셋을 위한 엔드 투 엔드 제작 파이프라인과 다양한 영화적 장면을 위한 MLLM 기반 더빙 모델인 FunCineForge를 제안합니다. 이 파이프라인을 사용하여 풍부한 주석이 포함된 중국 TV 더빙 데이터셋을 구축하고, 이러한 데이터의 높은 품질을 입증했습니다. 단일 대사, 내레이션, 대화, 그리고 다중 화자 장면에서의 실험 결과, 우리의 더빙 모델이 오디오 품질, 입술 동기화, 음색 전달, 그리고 지시사항 준수 측면에서 최첨단(SOTA) 방법보다 일관되게 우수한 성능을 보였습니다. 코드 및 데모는 https://anonymous.4open.science/w/FunCineForge 에서 확인할 수 있습니다.

Original Abstract

Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.

2 Citations

1 Influential

3.5 Altmetric

21.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!