2604.17656v1 Apr 19, 2026 cs.SD

Video-Robin: 의도 기반 비디오-음악 생성을 위한 자기 회귀적 확산 계획

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

Sreyan Ghosh

Citations: 49

h-index: 4

Lie Lu

Citations: 19

h-index: 3

Dinesh Manocha

Citations: 28

h-index: 3

R. Duraiswami

Citations: 12,044

h-index: 57

Vaibhavi Lokegaonkar

Citations: 41

h-index: 3

Aryan Vijay Bhosale

Citations: 0

h-index: 0

Vishnu Raj

Citations: 15

h-index: 2

K. Gouthaman

Citations: 3

h-index: 1

비디오-음악(V2M)은 입력 비디오에 배경 음악을 생성하는 기본적인 작업입니다. 최근의 V2M 모델들은 주로 시각적 정보에만 의존하여 오디오-비디오 정렬을 달성하며, 사용자에게 제한적인 의미론적 및 스타일적 제어 기능을 제공합니다. 본 논문에서는 Video-Robin이라는 새로운 텍스트 기반 비디오-음악 생성 모델을 제시합니다. Video-Robin은 비디오 콘텐츠에 대해 빠르고 고품질의 의미론적으로 일관된 음악 생성을 가능하게 합니다. Video-Robin은 음악적 충실도와 의미론적 이해의 균형을 맞추기 위해 자기 회귀적 계획과 확산 기반 합성을 결합합니다. 구체적으로, 자기 회귀 모듈은 시각적 및 텍스트 입력을 의미론적으로 정렬하여 전역 구조를 모델링하고, 고수준 음악 잠재 변수를 생성합니다. 이러한 잠재 변수는 이후 로컬 Diffusion Transformers를 사용하여 일관되고 고품질의 음악으로 다듬어집니다. Video-Robin은 의미론적으로 구동되는 계획을 확산 기반 합성에 통합함으로써 오디오의 현실감을 희생하지 않고도 세밀한 제작자 제어를 가능하게 합니다. 제안된 모델은 비디오 입력만 사용하는 기존 모델 및 추가적인 특징 기반 모델보다 in-distribution 및 out-of-distribution 벤치마크에서 모두 더 우수한 성능을 보이며, SOTA 모델에 비해 추론 속도가 2.21배 빠릅니다. 논문 채택 시 모든 코드를 공개할 예정입니다.

Original Abstract

Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.

0 Citations

0 Influential

28.5 Altmetric

142.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!