2606.09669v1 Jun 08, 2026 cs.AI

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hongyi Yuan

Citations: 6,219

h-index: 19

Zihao Huang

Citations: 1,565

h-index: 6

Nan Duan

Citations: 45

h-index: 3

Bohan Zeng

Citations: 234

h-index: 9

Wenjie Li

Citations: 18

h-index: 2

Wentao Zhang

Citations: 38

h-index: 3

Bo Wang

Citations: 35

h-index: 2

Haoyang Huang

Citations: 397

h-index: 5

Hongcheng Gao

Citations: 83

h-index: 3

Hailong Qu

Citations: 31

h-index: 2

Jingyi Tang

Citations: 28

h-index: 2

Jiahao Wang

Citations: 19

h-index: 2

Hengkang Qiao

Citations: 0

h-index: 0

Shihong Huang

Citations: 24

h-index: 1

Junming Yang

Citations: 5

h-index: 2

Yi Li

Citations: 58

h-index: 4

Wenbo Li

Citations: 259

h-index: 2

Jianhui Liu

Citations: 89

h-index: 2

Oliver Huang

Citations: 40

h-index: 2

Guo-Ting Huang

Citations: 61

h-index: 1

Yinpeng Dong

Citations: 218

h-index: 5

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!