2603.02024v1 Mar 02, 2026 cs.CL

MMR-Life: 실생활 장면을 연결하여 다중 모드, 다중 이미지 추론을 위한 벤치마크

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Pengfei Cao

Institute of Automation, Chinese Academy of Sciences

Citations: 1,719

h-index: 21

Kang Liu

Citations: 4,591

h-index: 29

Zhuoran Jin

Citations: 607

h-index: 13

Jiachun Li

Citations: 172

h-index: 6

Shao-Gang Huang

Citations: 3

h-index: 1

Chenlong Zhang

Citations: 18

h-index: 3

Yubo Chen

Institute of Automation, Chinese Academy of Sciences

Citations: 4,818

h-index: 27

Jun Zhao

Citations: 35

h-index: 3

최근 다중 모드 대규모 언어 모델(MLLM)의 추론 능력 발전은 과학적 분석 및 수학적 추론과 같은 복잡한 작업 수행을 가능하게 했습니다. 그러나 이러한 모델들의 실제 생활 환경에서의 추론 능력은 여전히 제한적이며, 평가를 위한 표준화된 벤치마크가 부족합니다. 이러한 격차를 해소하기 위해, 우리는 실제 생활 시나리오에서 MLLM의 다양한 다중 모드, 다중 이미지 추론 능력을 평가하기 위한 종합적인 벤치마크인 MMR-Life를 소개합니다. MMR-Life는 주로 실제 환경에서 수집된 19,108개의 이미지에 기반한 2,646개의 객관식 문제로 구성되어 있으며, 귀납, 유추, 인과, 연역, 귀납, 공간, 시간 추론을 포함한 7가지 추론 유형을 포괄적으로 다룹니다. 기존의 추론 벤치마크와 달리, MMR-Life는 특정 분야의 전문 지식에 의존하지 않으며, 모델이 여러 이미지에 걸쳐 정보를 통합하고 다양한 추론 능력을 적용하도록 요구합니다. 37개의 고급 모델에 대한 평가 결과는 MMR-Life가 상당한 수준의 어려움을 제시함을 보여줍니다. GPT-5와 같은 최고 성능의 모델조차도 58%의 정확도를 기록했으며, 추론 유형에 따라 성능 편차가 큽니다. 또한, 기존 MLLM의 추론 방식을 분석하여, 사고의 깊이, 추론 방법, 추론 유형과 같은 요인이 성능에 미치는 영향을 조사했습니다. 요약하자면, MMR-Life는 차세대 다중 모드 추론 시스템을 평가, 분석 및 개선하기 위한 포괄적인 기반을 제공합니다.

Original Abstract

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.

1 Citations

0 Influential

14.5 Altmetric

73.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!