2602.15918v1 Feb 17, 2026 cs.CV

EarthSpatialBench: 지구 이미지 기반 멀티모달 LLM의 공간 추론 능력 벤치마킹

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Zelin Xu

Citations: 61

h-index: 5

Yupu Zhang

Citations: 27

h-index: 3

Saugat Adhikari

Citations: 122

h-index: 5

S. Islam

Citations: 262

h-index: 4

Tingsong Xiao

Citations: 57

h-index: 5

Zibo Liu

Citations: 59

h-index: 5

Shigang Chen

Citations: 40

h-index: 3

Da Yan

Citations: 26

h-index: 3

Zhejun Jiang

Citations: 905

h-index: 6

멀티모달 대규모 언어 모델(MLLM)의 공간 추론 능력에 대한 벤치마킹은, 물리 세계와의 정밀한 상호작용을 요구하는 자율 에이전트 시스템 및 임베디드 AI 분야에서 중요한 역할을 하기 때문에 컴퓨터 비전 분야에서 점점 더 많은 관심을 받고 있습니다. 그러나 지구 이미지에 대한 공간 추론은 아직 발전이 더 필요한 분야이며, 이는 지리 참조 이미지에 객체를 연결하고, 시각적 단서와 벡터 기하학 좌표(예: 2D 바운딩 박스, 폴리라인, 다각형)를 사용하여 거리, 방향 및 위상 관계에 대해 정량적으로 추론하는 것을 포함하기 때문입니다. 현재 지구 이미지에 대한 벤치마크는 주로 2D 공간 연결, 이미지 캡셔닝 및 대략적인 공간 관계(예: 간단한 방향 또는 근접성 단서)에 초점을 맞추고 있습니다. 이러한 벤치마크는 정량적인 방향 및 거리 추론, 체계적인 위상 관계 및 바운딩 박스 이상의 복잡한 객체 기하학에 대한 지원이 부족합니다. 이러한 격차를 해소하기 위해, 우리는 지구 이미지 기반 MLLM의 공간 추론 능력을 평가하기 위한 종합적인 벤치마크인 extbf{EarthSpatialBench}를 제안합니다. 이 벤치마크는 32만 5천 개 이상의 질문-답변 쌍을 포함하며, 다음과 같은 내용을 다룹니다: (1) 공간 거리 및 방향에 대한 질적 및 정량적 추론, (2) 체계적인 위상 관계, (3) 단일 객체 쿼리, 객체 쌍 쿼리 및 복합 집계 그룹 쿼리, (4) 텍스트 설명, 시각적 오버레이 및 명시적인 기하학 좌표(예: 2D 바운딩 박스, 폴리라인, 다각형)를 통한 객체 참조. 우리는 오픈 소스 및 독점 모델 모두에 대해 광범위한 실험을 수행하여 MLLM의 공간 추론 능력의 한계를 파악했습니다.

Original Abstract

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!