2602.08342v1 Feb 09, 2026 cs.CV

UrbanGraphEmbeddings: 도시 과학을 위한 공간 기반 다중 모드 임베딩 학습 및 평가

UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

Jie Zhang

Citations: 47

h-index: 2

Xingtong Yu

Citations: 738

h-index: 12

Yuan Fang

Citations: 0

h-index: 0

Rudi Stouffs

Citations: 0

h-index: 0

Zdravko Trivic

Citations: 11

h-index: 1

도시 환경에 대한 전이 가능한 다중 모드 임베딩을 학습하는 것은 어려운데, 그 이유는 도시 이해는 본질적으로 공간적 특성을 가지지만, 기존 데이터셋과 벤치마크는 스트리트 뷰 이미지와 도시 구조 간의 명시적인 정렬이 부족하기 때문입니다. 우리는 UGData라는 공간 기반 데이터셋을 소개합니다. UGData는 스트리트 뷰 이미지를 구조화된 공간 그래프에 연결하고, 공간 추론 경로와 공간 맥락 설명을 통해 그래프 정렬된 감독 학습을 제공하여 이미지 콘텐츠 외에도 거리, 방향성, 연결성 및 주변 맥락을 드러냅니다. UGData를 기반으로, 우리는 UGE라는 두 단계의 학습 전략을 제안합니다. UGE는 지침 기반 대비 학습과 그래프 기반 공간 인코딩을 결합하여 이미지, 텍스트 및 공간 구조를 점진적이고 안정적으로 정렬합니다. 마지막으로, 우리는 다양한 도시 이해 작업(지리 위치 순위, 이미지 검색, 도시 인식 및 공간 정렬)을 지원하는 공간 기반 임베딩의 성능을 평가하기 위한 포괄적인 벤치마크인 UGBench를 소개합니다. 우리는 Qwen2-VL, Qwen2.5-VL, Phi-3-Vision 및 LLaVA1.6-Mistral을 포함한 최첨단 VLM 백본에서 UGE를 개발하고, LoRA 튜닝을 사용하여 고정 크기의 공간 임베딩을 학습했습니다. Qwen2.5-VL-7B 백본을 기반으로 구축된 UGE는 학습 도시에서 이미지 검색 성능을 최대 44% 향상시키고, 지리 위치 순위 성능을 30% 향상시켰으며, 검증 데이터셋에서는 각각 30% 이상, 22% 이상의 성능 향상을 보여주었습니다. 이는 명시적인 공간 정렬이 공간 집약적인 도시 작업에 효과적임을 입증합니다.

Original Abstract

Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!