2604.07973v1 Apr 09, 2026 cs.AI

대규모 멀티모달 모델은 인간 수준의 공간적 행동과 얼마나 동떨어져 있는가? 도시 항공 공간에서의 목표 지향적 임베디드 네비게이션 벤치마크

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

Ziyou Wang

Citations: 69

h-index: 3

Baining Zhao

Citations: 145

h-index: 6

Jianjie Fang

Citations: 108

h-index: 5

Yanggang Xu

Citations: 58

h-index: 4

Yatai Ji

Citations: 38

h-index: 3

Jiachen Xu

Citations: 7

h-index: 1

Qian Zhang

Citations: 277

h-index: 7

Weichen Zhang

Citations: 103

h-index: 5

Xinlei Chen

Citations: 169

h-index: 7

Zile Zhou

Citations: 67

h-index: 3

Chen Gao

Citations: 156

h-index: 6

대규모 멀티모달 모델(LMM)은 뛰어난 시각-언어 추론 능력을 보여주지만, 공간적 의사 결정 및 행동 수행 능력은 아직 명확하지 않습니다. 본 연구에서는, 도시 3D 공간에서의 목표 지향적 네비게이션이라는 어려운 시나리오를 통해 LMM이 인간과 유사한 임베디드 공간적 행동을 달성할 수 있는지 조사합니다. 먼저, 3D 수직 행동과 풍부한 도시 의미 정보를 강조한 5,037개의 고품질 목표 지향적 네비게이션 샘플로 구성된 데이터셋을 구축하는 데 500시간 이상을 투자했습니다. 그런 다음, 비추론 LMM, 추론 LMM, 에이전트 기반 방법, 시각-언어-행동 모델을 포함한 17개의 대표 모델을 종합적으로 평가했습니다. 실험 결과, 현재 LMM은 새로운 행동 능력을 보여주지만, 여전히 인간 수준의 성능에는 미치지 못하는 것으로 나타났습니다. 또한, 네비게이션 오류가 선형적으로 누적되지 않고, 중요한 의사 결정 지점에서 급격하게 목표 지점에서 벗어나는 흥미로운 현상을 발견했습니다. LMM의 한계를 분석하기 위해 이러한 중요한 의사 결정 지점에서 모델의 행동을 조사했습니다. 마지막으로, 기하학적 인식, 다중 시점 이해, 공간적 상상력, 장기 기억이라는 네 가지 유망한 개선 방향을 실험적으로 탐구했습니다. 본 프로젝트는 다음 링크에서 확인할 수 있습니다: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.

Original Abstract

Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.

0 Citations

0 Influential

23.5 Altmetric

117.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!