2603.16289v1 Mar 17, 2026 cs.CV

VisBrowse-Bench: 다중 모드 탐색 에이전트를 위한 시각 기반 검색 벤치마크

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Kevin I-Kai Wang

Citations: 0

h-index: 0

Jinbo Su

Citations: 20

h-index: 3

Yifei Zhang

Emory University

Citations: 199

h-index: 7

Zhaowen Zhou

Citations: 16

h-index: 3

Changtao Miao

Citations: 27

h-index: 3

Yu Hong

Citations: 6

h-index: 2

Qimeng Wu

Citations: 0

h-index: 0

Yumeng Liu

Citations: 42

h-index: 4

Fei Wu

Citations: 8

h-index: 2

Yihe Tian

Citations: 2

h-index: 1

Yuhao Liang

Citations: 21

h-index: 1

Zitong Shan

Citations: 51

h-index: 2

Wanke Xia

Citations: 12

h-index: 1

Bo Zhang

Citations: 5

h-index: 1

Zhe Li

Citations: 11

h-index: 2

Shiming Xiang

Citations: 91

h-index: 6

Ying Yan

Citations: 54

h-index: 4

다중 모드 대규모 언어 모델(MLLM)의 빠른 발전으로 인해 탐색 에이전트는 실제 세계의 다중 모드 정보를 습득하고 추론할 수 있게 되었습니다. 그러나 기존 벤치마크는 시각적 추론 능력의 부족한 평가와 웹 페이지의 고유한 시각 정보가 추론 과정에서 간과된다는 두 가지 한계를 가지고 있습니다. 이러한 문제점을 해결하기 위해, 우리는 시각 기반 검색을 위한 새로운 벤치마크인 VisBrowse-Bench를 소개합니다. 이 벤치마크는 다양한 도메인을 포괄하는 169개의 질의응답(VQA) 인스턴스를 포함하고 있으며, 텍스트-이미지 검색 및 통합 추론을 통해 다중 모드 증거 교차 검증을 통해 모델의 시각적 추론 능력을 평가합니다. 이 데이터는 인간 전문가가 다단계 파이프라인을 사용하여 구축했으며, 엄격한 수동 검증을 거쳤습니다. 또한, 우리는 탐색 과정에서 에이전트가 시각 정보를 적극적으로 수집하고 추론하도록 효과적으로 유도할 수 있는 에이전트 워크플로우를 제안합니다. 우리는 이 워크플로우에서 오픈 소스 및 클로즈드 소스 모델을 종합적으로 평가했습니다. 실험 결과는 Claude-4.6-Opus와 같은 가장 성능이 좋은 모델도 47.6%의 정확도에 불과하고, 독점 모델인 o3-deep-research는 41.1%의 정확도에 그친다는 것을 보여줍니다. 코드와 데이터는 다음 위치에서 확인할 수 있습니다: https://github.com/ZhengboZhang/VisBrowse-Bench

Original Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench

0 Citations

0 Influential

32.45879734614 Altmetric

162.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!