2604.25122v1 Apr 28, 2026 cs.CV

M$^3$-VQA: 다중 모달, 다중 개체, 다중 단계 시각 질의 응답을 위한 벤치마크

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

Jiatong Ma

Citations: 39

h-index: 2

Longteng Guo

Citations: 1,547

h-index: 17

Yuchen Liu

Citations: 63

h-index: 3

Dongze Hao

Citations: 91

h-index: 4

Xuanxu Lin

Citations: 13

h-index: 2

Zijia Zhao

Citations: 536

h-index: 8

Jing Liu

Citations: 239

h-index: 7

본 논문에서는 다중 모달 대규모 언어 모델(MLLM)의 미세 수준 다중 모달 개체 이해 및 복잡한 다중 단계 추론 능력을 향상시키기 위한 새로운 지식 기반 시각 질의 응답(VQA) 벤치마크인 M$^3$-VQA를 제안합니다. 기존 VQA 데이터셋이 대략적인 범주와 단일 개체에 대한 단순한 추론에 초점을 맞추는 것과 달리, M$^3$-VQA는 시각 및 텍스트 소스 모두에서 다양한 다중 개체를 포함하는 질문을 도입합니다. 이는 모델이 여러 문서에 걸쳐 순차적 및 병렬 다중 단계 추론을 수행하도록 요구하며, 추적 가능한 상세한 증거와 선별된 다중 모달 지식 베이스를 지원합니다. 우리는 16개의 선도적인 MLLM을 외부 지식 없이, 금색 증거를 사용한 경우, 그리고 검색 증강 입력을 사용한 경우의 세 가지 설정에서 평가했습니다. 부진한 결과는 MLLM의 지식 획득 및 추론에 상당한 어려움이 있음을 보여줍니다. 모델은 외부 정보 없이 성능이 좋지 않지만, 정확한 증거를 제공하면 성능이 크게 향상됩니다. 또한, 추론에 대한 고려가 반영된 에이전트 기반 검색은 휴리스틱 방법에 비해 우수한 성능을 보이며, 이는 복잡한 다중 모달 이해를 위한 체계적인 추론의 중요성을 강조합니다. M$^3$-VQA는 MLLM의 다중 모달 추론 능력을 발전시키기 위한 더욱 도전적인 평가 기준을 제시합니다. 우리의 코드와 데이터셋은 https://github.com/CASIA-IVA-Lab/M3VQA 에서 제공됩니다.

Original Abstract

We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M$^3$-VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning-aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M$^3$-VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at https://github.com/CASIA-IVA-Lab/M3VQA.

0 Citations

0 Influential

28.5 Altmetric

142.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!