2602.23353v1 Feb 26, 2026 cs.LG

SOTAlign: 최적 수송을 이용한 단일 모드 시각 및 언어 모델의 반지도 학습 기반 정렬

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann

Citations: 8

h-index: 2

Paul Krzakala

Citations: 10

h-index: 2

Sonia Mazelet

Citations: 21

h-index: 2

Quentin Bouniot

Citations: 111

h-index: 6

Zeynep Akata

Citations: 1,547

h-index: 21

플라톤 표현 가설은 서로 다른 모드에서 학습된 신경망이 세상의 공유된 통계적 모델로 수렴한다고 주장합니다. 최근 연구에서는 이 수렴을 활용하여 가벼운 정렬 레이어를 사용하여 사전 훈련된 시각 및 언어 모델을 정렬하지만, 일반적으로 대비 손실과 수백만 개의 쌍으로 이루어진 데이터에 의존합니다. 본 연구에서는 훨씬 적은 수준의 지도 학습만으로 의미 있는 정렬이 가능한지 질문합니다. 우리는 사전 훈련된 단일 모드 인코더를 소수의 이미지-텍스트 쌍과 대량의 쌍을 이루지 않는 데이터와 함께 정렬하는 반지도 학습 환경을 소개합니다. 이 문제를 해결하기 위해, 우리는 SOTAlign이라는 2단계 프레임워크를 제안합니다. SOTAlign은 먼저 선형 지도 모델을 사용하여 제한된 쌍으로 이루어진 데이터로부터 대략적인 공유 기하 구조를 복구한 다음, 최적 수송 기반의 발산법을 사용하여 쌍을 이루지 않는 샘플에 대한 정렬을 개선합니다. 이 방법은 관계 구조를 전송하면서 대상 공간을 과도하게 제약하지 않습니다. 기존의 반지도 학습 방법과 달리, SOTAlign은 쌍을 이루지 않는 이미지와 텍스트를 효과적으로 활용하여 데이터 세트 및 인코더 쌍에 걸쳐 강력한 통합 임베딩을 학습하며, 지도 학습 및 반지도 학습 기반의 기존 방법보다 훨씬 뛰어난 성능을 보입니다.

Original Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!