2606.09331v1 Jun 08, 2026 cs.MM

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Zhiyuan Hu
Zhiyuan Hu
Citations: 304
h-index: 8
Shiyu Li
Shiyu Li
Citations: 87
h-index: 5
Yifan Wang
Yifan Wang
Citations: 34
h-index: 2
Peiming Li
Peiming Li
Citations: 20
h-index: 3
Zhengxuan Wei
Zhengxuan Wei
Citations: 22
h-index: 2
Yang Tang
Yang Tang
Citations: 95
h-index: 5

Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.

0 Citations
0 Influential
4 Altmetric
20.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!