2601.17737v2 Jan 25, 2026 cs.CV

스크립트는 전부입니다: 장기 대화-영화 영상 생성에 대한 주체적인 프레임워크

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

F. Ye

Citations: 81

h-index: 5

Bo Zhao

Citations: 6

h-index: 2

Ruotian Ma

Citations: 99

h-index: 5

Zhaopeng Tu

Citations: 135

h-index: 6

Xiaolong Li

Citations: 132

h-index: 6

Linus

Citations: 617

h-index: 11

Chenyu Mu

Citations: 26

h-index: 2

Xin He

Citations: 117

h-index: 6

Qu Yang

Citations: 11

h-index: 2

Wanshun Chen

Citations: 2

h-index: 1

Jiadi Yao

Citations: 2

h-index: 1

Huang Liu

Citations: 32

h-index: 2

Zihao Yi

Citations: 254

h-index: 4

Xingyu Chen

Citations: 47

h-index: 3

Erkun Yang

Citations: 2,511

h-index: 21

Cheng Deng

Citations: 6

h-index: 2

최근 비디오 생성 기술의 발전으로 텍스트 프롬프트를 기반으로 놀라운 시각적 콘텐츠를 생성할 수 있는 모델들이 등장했습니다. 그러나 이러한 모델들은 대화와 같은 고수준 개념으로부터 긴 형식의 일관성 있는 내러티브를 생성하는 데 어려움을 겪으며, 창의적인 아이디어와 영화적 구현 사이의 '의미 격차'를 드러냅니다. 이러한 격차를 해소하기 위해, 대화-영화 영상 생성에 대한 새로운 통합형 주체적인 프레임워크를 소개합니다. 우리의 프레임워크의 핵심은 ScripterAgent 모델로, 이는 대략적인 대화를 세밀하고 실행 가능한 영화 시나리오로 변환하도록 훈련되었습니다. 이를 가능하게 하기 위해, 우리는 전문가의 지도를 받은 파이프라인을 통해 풍부한 다중 모드 컨텍스트가 주석 처리된 새로운 대규모 벤치마크인 ScriptBench를 구축했습니다. 생성된 시나리오는 DirectorAgent를 안내하며, DirectorAgent는 최첨단 비디오 모델을 활용하여 장면 간 연속적인 생성 전략을 사용하여 장기적인 일관성을 보장합니다. AI 기반의 CriticAgent와 새로운 Visual-Script Alignment (VSA) 지표를 포함한 종합적인 평가 결과는, 우리의 프레임워크가 테스트된 모든 비디오 모델에서 시나리오 충실도와 시간적 충실도를 크게 향상시킨다는 것을 보여줍니다. 또한, 우리의 분석은 현재 최고 성능 모델에서 시각적 효과와 엄격한 시나리오 준수 사이의 중요한 균형 문제를 밝혀내며, 자동 영화 제작의 미래에 대한 귀중한 통찰력을 제공합니다.

Original Abstract

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

2 Citations

1 Influential

10.5 Altmetric

56.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!