2601.19606v1 Jan 27, 2026 cs.CV

GMS-CAVP: 다중 스케일 대비 학습 및 생성적 사전 훈련을 통한 오디오-비디오 대응 성능 향상

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining

Shentong Mo

Citations: 1,710

h-index: 22

Zehua Chen

Citations: 157

h-index: 7

Jun Zhu

Citations: 32

h-index: 4

최근 오디오-비디오(V-A) 이해 및 생성 분야에서 V-A 임베딩이 중요한 역할을 하며, 이는 교차 모드 검색 및 생성과 같은 작업의 기반이 됩니다. 기존의 CAVP와 같은 방법들은 대비 학습을 통해 모달 간의 의미론적 및 시간적 상관관계를 효과적으로 모델링하지만, 성능이 여전히 최적이 아닙니다. 주요 한계점은 비디오 및 오디오 신호의 밀집되고 다중 스케일적인 특성을 충분히 모델링하지 못한다는 점입니다. 이러한 상관관계는 종종 미세한 수준부터 거친 수준까지 다양한 공간-시간 구조에 걸쳐 나타나지만, 기존 프레임워크에서는 이러한 부분들이 충분히 활용되지 못하고 있습니다. 이에, 본 논문에서는 다중 스케일 V-A 정렬과 다중 스케일 공간-시간 확산 기반 사전 훈련 목표를 결합하여 V-A 상관관계 모델링을 향상시키는 새로운 프레임워크인 GMS-CAVP를 제안합니다. 첫째, GMS-CAVP는 다양한 수준의 관계를 포착하는 다중 스케일 대비 학습 전략을 도입합니다. 둘째, 기존의 대비 학습 방식을 넘어 확산 기반의 생성적 목표를 통합하여 오디오와 비디오 간의 모달 변환 및 합성을 가능하게 합니다. 이러한 통합된 판별-생성 프레임워크는 더 깊은 교차 모달 이해를 가능하게 하며, 고품질 생성의 기반을 마련합니다. VGGSound, AudioSet, Panda70M 데이터셋에 대한 광범위한 실험 결과, GMS-CAVP는 기존 방법보다 생성 및 검색 성능에서 우수한 결과를 보였습니다.

Original Abstract

Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!