2603.01416v1 Mar 02, 2026 cs.AI

바닥을 안정시키고 천장을 높이는 방법: 멀티모달 검색 에이전트를 위한 병합 기반 패러다임

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Zhixiang Wang

Citations: 1

h-index: 1

Dajun Chen

Citations: 121

h-index: 4

Wei Jiang

Citations: 23

h-index: 4

Jing Xu

Citations: 6

h-index: 1

Yunfang Wu

Citations: 21

h-index: 2

Yong Li

Citations: 60

h-index: 4

최근 비전-언어 모델(VLM)의 발전은 외부 검색 도구를 능동적으로 활용하고 다단계 추론을 통해 검색된 증거를 통합하는 멀티모달 검색 에이전트 개발을 촉진했습니다. 이러한 접근 방식은 유망하지만, 기존 방법은 대규모 지도 학습 데이터 또는 비용이 많이 드는 강화 학습(RL)에 의존하는 경향이 있어 높은 학습 비용, 불안정성 및 표준 VLM의 심각한 초기 문제점을 야기합니다. 본 연구에서는 텍스트 기반 검색 에이전트와 기본 VLM을 융합하여 학습 없이 VLM에 자율 검색 기능을 부여하는 새로운 패러다임을 제안합니다. 텍스트 기반 검색 에이전트와 기본 VLM을 결합함으로써, 추가적인 멀티모달 학습 데이터 없이도 멀티모달 검색 능력을 효과적으로 구현할 수 있음을 보여줍니다. 또한, 크로스모달 통합 과정에서 발생하는 파라미터 간 간섭을 완화하기 위해, 모델 손실에 미치는 영향을 기반으로 중요한 파라미터를 식별하는, 중요도 기반 융합 알고리즘인 최적 뇌 융합(OBM)을 제안합니다. InfoSeek 및 MMSearch와 같은 검색 집약적인 벤치마크에서 수행한 광범위한 실험 결과, (1) 모델 융합은 OBM을 통해 우수한 검색 성능을 달성하며, 초기 성능의 안정적인 기반을 제공합니다. (2) OBM은 표준 VLM 초기화 방식보다 빠른 수렴 속도와 더 높은 최고 정확도를 달성하며, 성능의 상한을 크게 향상시키는 효과적인 초기 전략임을 확인했습니다.

Original Abstract

Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!