2602.07125v1 Feb 06, 2026 cs.IR

추론 증강 표현을 이용한 다중 모드 검색

Reasoning-Augmented Representations for Multimodal Retrieval

Sukanta Ganguly

Citations: 2

h-index: 1

Jianrui Zhang

University of Wisconsin-Madison

Citations: 249

h-index: 5

A. Rajan

Citations: 70

h-index: 3

Brandon Han

Citations: 18

h-index: 2

Soochahn Lee

Citations: 23

h-index: 3

Y. Lee

Citations: 43

h-index: 2

범용 다중 모드 검색(Universal Multimodal Retrieval, UMR)은 텍스트와 이미지를 융합하여 모든 것 간의 검색을 목표로 하지만, 현대의 임베딩 모델은 숨겨진 추론이 필요한 경우(예: 불명확한 참조 해결 또는 복합적인 제약 조건 부합)에 취약성을 보이는 경향이 있습니다. 우리는 이러한 취약성이 종종 데이터에 의해 유발된다고 주장합니다. 즉, 이미지에 '숨겨진' 증거가 있고, 쿼리에 핵심 의미가 명시적으로 드러나지 않을 때, 단일 임베딩 과정은 추론과 압축을 동시에 수행해야 하므로, 의도하지 않은 특징 매칭을 유발할 수 있습니다. 우리는 이러한 문제를 해결하기 위해, 추론 과정을 검색 전에 외부적으로 수행하는 데이터 중심의 프레임워크를 제안합니다. 강력한 시각-언어 모델을 사용하여, 시각적 증거에 대한 상세한 설명을 추가하고, 쿼리의 불명확한 다중 모드 참조를 해결하고, 장황한 지침을 간결한 검색 제약 조건으로 재구성합니다. 추론 과정에서의 개선만으로는 충분하지 않으며, 검색 모델은 이러한 의미적으로 풍부한 표현으로 학습되어야 데이터 분포의 변화를 방지하고 추가적인 정보를 최대한 활용할 수 있습니다. M-BEIR 데이터셋에 대한 실험 결과, 우리의 추론 증강 학습 방법은 강력한 기준 모델보다 일관된 성능 향상을 보였습니다. 추가 분석 결과, 데이터 증강은 주로 지식 집약적인 쿼리에 큰 도움이 되지만, 쿼리 증강은 복합적인 수정 요청에 필수적임이 확인되었습니다. 저희의 코드는 https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval 에서 공개적으로 이용하실 수 있습니다.

Original Abstract

Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.

3 Citations

0 Influential

22.5 Altmetric

115.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!