2604.04875v1 Apr 06, 2026 cs.CV

DIRECT: 계층적 다중 에이전트 계획 및 의도 기반 편집을 통한 비디오 마쉬업 생성

DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Keqian Li

Citations: 32,377

h-index: 9

Jialiang Chen

Citations: 39

h-index: 4

Jiayu Chen

Citations: 41

h-index: 4

Zihao Zheng

Citations: 34

h-index: 3

Shaoqi Wang

Citations: 3

h-index: 1

Maoliang Li

Citations: 46

h-index: 4

Xiang Chen

Citations: 49

h-index: 4

비디오 마쉬업 생성은 기존 영상을 재구성하여 매력적인 시청각 경험을 만드는 복잡한 비디오 편집 방식으로, 의미, 시각, 청각 측면 및 여러 수준에서의 정교한 조율을 요구합니다. 그러나 기존의 자동 편집 프레임워크는 종종 전문적인 수준의 자연스러움을 달성하기 위한 다층적이고 다중 모달의 조율을 간과하여, 단절된 시퀀스와 갑작스러운 시각적 전환, 음악 불일치 등의 문제가 발생합니다. 이러한 문제를 해결하기 위해, 본 연구에서는 비디오 마쉬업 생성을 다중 모달 일관성 만족 문제(MMCSP)로 정의하고, DIRECT 프레임워크를 제안합니다. 저희의 계층적 다중 에이전트 프레임워크는 전문적인 제작 파이프라인을 모방하여, 문제를 세 가지 단계로 분해합니다. 먼저, Screenwriter는 소스 영상의 특성을 고려하여 전체적인 구조를 설정하고, Director는 적응적인 편집 의도와 지침을 제공하며, Editor는 의도에 기반하여 샷 시퀀스를 편집하고 세밀한 최적화를 수행합니다. 또한, 시각적 연속성과 청각적 정렬을 위한 맞춤형 지표를 포함하는 종합적인 벤치마크인 Mashup-Bench를 소개합니다. 광범위한 실험 결과, DIRECT는 객관적인 지표와 인간의 주관적인 평가 모두에서 최첨단 모델보다 훨씬 뛰어난 성능을 보이는 것을 확인했습니다. 프로젝트 페이지 및 코드는 다음 링크에서 확인할 수 있습니다: https://github.com/AK-DREAM/DIRECT

Original Abstract

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT

3 Citations

0 Influential

27.9657359028 Altmetric

142.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!