2606.09669v1 Jun 08, 2026 cs.AI

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hongyi Yuan
Hongyi Yuan
Citations: 6,219
h-index: 19
Zihao Huang
Zihao Huang
Citations: 1,565
h-index: 6
Nan Duan
Nan Duan
Citations: 45
h-index: 3
Bohan Zeng
Bohan Zeng
Citations: 234
h-index: 9
Wenjie Li
Wenjie Li
Citations: 18
h-index: 2
Wentao Zhang
Wentao Zhang
Citations: 38
h-index: 3
Bo Wang
Bo Wang
Citations: 35
h-index: 2
Haoyang Huang
Haoyang Huang
Citations: 397
h-index: 5
Hongcheng Gao
Hongcheng Gao
Citations: 83
h-index: 3
Hailong Qu
Hailong Qu
Citations: 31
h-index: 2
Jingyi Tang
Jingyi Tang
Citations: 28
h-index: 2
Jiahao Wang
Jiahao Wang
Citations: 19
h-index: 2
Hengkang Qiao
Hengkang Qiao
Citations: 0
h-index: 0
Shihong Huang
Shihong Huang
Citations: 24
h-index: 1
Junming Yang
Junming Yang
Citations: 5
h-index: 2
Yi Li
Yi Li
Citations: 58
h-index: 4
Wenbo Li
Wenbo Li
Citations: 259
h-index: 2
Jianhui Liu
Jianhui Liu
Citations: 89
h-index: 2
Oliver Huang
Oliver Huang
Citations: 40
h-index: 2
Guo-Ting Huang
Guo-Ting Huang
Citations: 61
h-index: 1
Yinpeng Dong
Yinpeng Dong
Citations: 218
h-index: 5

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

0 Citations
0 Influential
9.5 Altmetric
47.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!