2601.14339v1 Jan 20, 2026 cs.CV

CityCube: 도시 환경에서 비전-언어 모델의 다중 시점 공간 추론 성능 평가

CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments

Zhengqiu Zhu

Citations: 6

h-index: 1

Haotian Xu

Citations: 19

h-index: 3

Yue Hu

Citations: 42

h-index: 3

Chen Gao

Citations: 50

h-index: 3

Ziyou Wang

Citations: 69

h-index: 3

J. Rao

Citations: 356

h-index: 7

Wenhao Lu

Citations: 4

h-index: 1

Weishi Li

Citations: 1

h-index: 1

Quanjun Yin

Citations: 148

h-index: 7

Yong Li

Citations: 149

h-index: 6

다중 시점 공간 추론은 복잡한 환경에서의 공간 이해, 정신적 시뮬레이션 및 계획에 필수적인 요소로, 로봇 공학 인공지능의 핵심입니다. 기존의 벤치마크는 주로 실내 또는 거리 환경에 초점을 맞추고 있으며, 풍부한 의미, 복잡한 기하학적 구조 및 다양한 시점을 특징으로 하는 개방형 도시 공간의 고유한 과제를 간과하고 있습니다. 이러한 문제를 해결하기 위해, 본 논문에서는 도시 환경에서 현재 비전-언어 모델(VLM)의 다중 시점 추론 능력을 평가하기 위한 체계적인 벤치마크인 CityCube를 소개합니다. CityCube는 카메라 움직임을 모방하기 위한 네 가지 시점 변화 동역학을 통합하고 있으며, 차량, 드론 및 위성 등 다양한 플랫폼에서 얻은 광범위한 시점을 포함합니다. 포괄적인 평가를 위해, CityCube는 5,022개의 정교하게 주석이 달린 다중 시점 질의응답 쌍을 포함하며, 이는 다섯 가지 인지적 차원과 세 가지 공간 관계 표현으로 분류됩니다. 33개의 VLM에 대한 종합적인 평가 결과, 인간 수준의 성능에 상당한 격차가 존재함을 보여줍니다. 심지어 대규모 모델조차도 54.1%의 정확도를 넘어서는 데 어려움을 겪으며, 인간 성능보다 34.2% 낮은 성능을 보입니다. 반면, 소규모로 미세 조정된 VLM은 60.0% 이상의 정확도를 달성하여, 본 벤치마크의 필요성을 강조합니다. 추가적인 분석을 통해, 작업 간의 상관 관계 및 VLM과 인간과 같은 추론 간의 근본적인 인지적 차이를 밝힙니다.

Original Abstract

Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs reveals a significant performance disparity with humans: even large-scale models struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. By contrast, small-scale fine-tuned VLMs achieve over 60.0% accuracy, highlighting the necessity of our benchmark. Further analyses indicate the task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!