2601.11995v1 Jan 17, 2026 cs.MM

추론된 잠재적 상호작용 그래프를 활용한 오디오-비디오 임베딩 학습

Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Donghuo Zeng

Citations: 7

h-index: 2

Hao Niu

Citations: 7

h-index: 2

Yanan Wang

Citations: 15

h-index: 3

Masato Taya

Citations: 6

h-index: 2

견고한 오디오-비디오 임베딩을 학습하려면 진정으로 관련된 오디오 및 비디오 신호를 결합하고, 동시에 우연히 발생하는 동시 발생 현상(배경 잡음, 관련 없는 요소 또는 주석이 없는 이벤트)을 제거해야 합니다. 대부분의 대비 학습(contrastive learning) 및 삼중 손실(triplet-loss) 방법은 클립당 희소한 주석 레이블을 사용하며, 어떤 동시 발생 현상도 의미적 유사성으로 간주합니다. 예를 들어, '기차'로 레이블된 비디오에는 '오토바이'의 오디오 및 비주얼 요소가 포함될 수 있는데, 이는 '오토바이'가 선택된 주석이 아니기 때문입니다. 기존 방법은 이러한 동시 발생 현상을 다른 곳에 있는 실제 '오토바이' 앵커에 대한 부정 예시로 처리하여, 오탐(false negatives)을 발생시키고 실제 모달 간의 의존성을 놓치게 됩니다. 우리는 이러한 문제를 해결하기 위해 소프트 레이블 예측과 추론된 잠재적 상호작용을 활용하는 프레임워크를 제안합니다. (1) 오디오-비주얼 의미 정렬 손실(AV-SAL)은 교사 네트워크를 훈련하여 모달 간에 정렬된 소프트 레이블 분포를 생성하며, 주석이 없는 동시 발생 이벤트에 0이 아닌 확률을 할당하여 지도 신호를 풍부하게 합니다. (2) 추론된 잠재적 상호작용 그래프(ILI)는 GRaSP 알고리즘을 교사 네트워크의 소프트 레이블에 적용하여 클래스 간의 희소한, 방향성이 있는 의존성 그래프를 추론합니다. 이 그래프는 방향성 의존성(예: '기차 (비주얼)' -> '오토바이 (오디오)')을 강조하여 클래스 간의 잠재적인 의미적 또는 조건부 관계를 파악합니다. 이러한 관계는 추정된 의존성 패턴으로 해석됩니다. (3) 잠재적 상호작용 정규화(LIR): 학생 네트워크는 메트릭 손실과 ILI 그래프에 의해 안내되는 정규화 기법을 함께 사용하여 훈련됩니다. 이 방법은 의존성이 있는(그러나 레이블이 없는) 쌍의 임베딩을 서로 가깝게 이동시키는데, 이때 이동 거리는 해당 쌍의 소프트 레이블 확률에 비례합니다. AVE 및 VEGAS 벤치마크에서의 실험 결과, 추론된 잠재적 상호작용을 임베딩 학습에 통합하면 평균 정밀도(mAP)가 꾸준히 향상되며, 이는 임베딩 학습의 견고성 및 의미적 일관성을 향상시킨다는 것을 보여줍니다.

Original Abstract

Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled "train" might also contain motorcycle audio and visual, because "motorcycle" is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., "Train (visual)" -> "Motorcycle (audio)") that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (mAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!