2601.03666v2 Jan 07, 2026 cs.CL

e5-omni: 명시적인 교차 모드 정렬을 통한 통합 모드 임베딩

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Zhicheng Dou

Citations: 439

h-index: 13

Haonan Chen

Citations: 215

h-index: 7

Sicheng Gao

Citations: 15

h-index: 2

R. Timofte

Citations: 55,281

h-index: 101

Tetsuya Sakai

Citations: 54

h-index: 3

현대의 정보 시스템은 종종 텍스트 쿼리, 이미지, 비디오 클립 또는 오디오 세그먼트와 같이 다양한 유형의 항목을 포함합니다. 이는 이질적인 모달리티를 공유 공간에 매핑하여 직접적인 비교를 가능하게 하는 통합 모드 임베딩 모델을 필요로 합니다. 그러나 대부분의 최신 통합 모드 임베딩은 여전히 사전 훈련된 비전-언어 모델(VLM)의 기반 구조에서 상속된 암시적 정렬에 크게 의존합니다. 실제로 이는 다음과 같은 세 가지 일반적인 문제를 야기합니다. (i) 유사성 로짓은 모달리티에 따라 달라지는 선명도를 가지므로, 점수가 일관된 척도에 있지 않습니다. (ii) 배치 내의 부정 샘플은 다양한 모달리티의 배치가 불균형한 난이도 분포를 만들기 때문에 시간이 지남에 따라 효과가 감소합니다. 결과적으로 많은 부정 샘플이 빠르게 단순해져서 큰 기울기를 제공하지 않습니다. (iii) 모달리티 간의 임베딩은 1차 및 2차 통계량이 일치하지 않아 순위가 불안정해집니다. 이러한 문제를 해결하기 위해, 우리는 사전 학습된 VLM을 강력한 통합 모드 임베딩 모델로 변환하는 가벼운 명시적 정렬 방법을 제안합니다. e5-omni는 세 가지 간단한 구성 요소로 구성됩니다. (1) 모달리티 인지 온도 보정은 유사성 척도를 정렬하고, (2) 편향 제거를 통해 혼동되는 부정 샘플에 집중하면서 거짓 부정 샘플의 영향을 줄이는 제어 가능한 부정 학습 커리큘럼, 그리고 (3) 공유 임베딩 공간에서 교차 모드 기하학을 더 잘 일치시키기 위한 배치 화이트닝 및 공분산 정규화입니다. MMEB-V2 및 AudioCaps 데이터 세트에서의 실험 결과는 강력한 양방향 및 통합 모드 기준 모델보다 일관된 성능 향상을 보였으며, 동일한 방법은 다른 VLM 기반 구조에도 잘 적용됩니다. 당사는 모델 체크포인트를 https://huggingface.co/Haon-Chen/e5-omni-7B 에서 제공합니다.

Original Abstract

Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.

2 Citations

0 Influential

50 Altmetric

252.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!