2601.19099v1 Jan 27, 2026 cs.CV

m2sv: 지도에서 스트리트 뷰로의 공간 추론을 위한 확장 가능한 벤치마크

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Igor Molybog

Citations: 31,610

h-index: 7

Yosub Shin

Citations: 1

h-index: 1

Michael Buriek

Citations: 0

h-index: 0

멀티모달 모델(VLMs)은 다양한 멀티모달 벤치마크에서 뛰어난 성능을 보이지만, 추상적인 전경 지도를 개인 시점에서 획득한 스트리트 뷰 이미지와 연결하는 공간 추론 작업에서는 여전히 취약점을 드러냅니다. 본 연구에서는 지도에서 스트리트 뷰로의 공간 추론을 위한 확장 가능한 벤치마크인 m2sv를 소개합니다. m2sv는 모델이 동일한 실제 교차로에서 촬영된 스트리트 뷰 이미지와 북쪽을 향한 전경 지도를 연결하여 카메라의 시야 방향을 추론하도록 합니다. 우리는 지리적으로 다양한 데이터셋으로 구성되고, 의도적으로 모호성을 조절한 m2sv-20k와, 지도 학습을 위한 구조화된 추론 과정을 포함하는 m2sv-sft-11k를 공개합니다. 기존의 멀티모달 벤치마크에서 높은 성능을 보이는 최상의 VLM 모델조차도 m2sv에서 65.2%의 정확도를 기록하는데, 이는 인간의 기준인 95%에 훨씬 미치지 못하는 수치입니다. 지도 학습 및 강화 학습을 통해 성능 향상을 이끌어낼 수 있지만, 벤치마크 간 비교 분석 결과, 일반화 능력은 제한적인 것으로 나타났습니다. 우리는 집계 정확도 외에도, 구조적 신호와 인간의 노력을 활용하여 지도에서 스트리트 뷰로의 추론 과정에서 발생하는 어려움을 체계적으로 분석하고, 개량된 오픈 소스 모델의 오류 분석을 수행했습니다. 이러한 결과는 기하학적 정렬, 증거 통합, 추론 일관성 측면에서 여전히 존재하는 문제점을 보여주며, 다양한 시점에서 이루어지는 공간 추론에 대한 향후 연구의 방향을 제시합니다.

Original Abstract

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!