2601.05432v1 Jan 08, 2026 cs.CV

지도와 함께 사고하기: 지리 위치 추정을 위한 강화 학습 기반 병렬 지도 활용 에이전트

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Yong Wang

Citations: 472

h-index: 12

Xiangxiang Chu

Citations: 278

h-index: 9

Yuxiang Ji

Citations: 126

h-index: 6

Ziyu Ma

Citations: 96

h-index: 4

Yiming Hu

Citations: 334

h-index: 4

Hailang Huang

Citations: 253

h-index: 6

Xuecai Hu

Citations: 17

h-index: 3

Guanhua Chen

Citations: 17

h-index: 2

Liaoni Wu

Citations: 116

h-index: 5

이미지 지리 위치 추정(geolocalization)은 시각적 단서를 이용하여 지구상의 어느 위치에서 이미지가 촬영되었는지 예측하는 작업입니다. 기존의 대규모 시각-언어 모델(LVLM) 접근 방식은 세계 지식, 사고 과정 추론, 그리고 에이전트 기능을 활용하지만, 인간이 흔히 사용하는 전략인 '지도 활용'은 간과합니다. 본 연구에서는 모델에 '지도와 함께 사고하기(Thinking with Map)' 능력을 부여하고, 이를 지도 내 에이전트 루프로 구성했습니다. 우리는 에이전트 기반 강화 학습(RL)과 병렬 테스트 시간 스케일링(TTS)을 포함하는 두 단계 최적화 방식을 개발했습니다. RL은 모델의 에이전트 능력을 강화하여 샘플링 효율성을 향상시키고, 병렬 TTS는 모델이 최종 예측을 내리기 전에 여러 후보 경로를 탐색할 수 있도록 하여 지리 위치 추정에 매우 중요합니다. 최신 및 실제 이미지 데이터 세트를 사용하여 본 방법을 평가하기 위해, 실제 이미지로 구성된 종합적인 지리 위치 추정 훈련 및 평가 벤치마크인 MAPBench를 추가적으로 제시합니다. 실험 결과는 본 방법이 대부분의 지표에서 기존의 공개 및 비공개 모델보다 우수한 성능을 보이며, 특히 Google Search/Map 기반 모드를 사용하는 extit{Gemini-3-Pro} 모델과 비교하여 Acc@500m 지표가 8.0%에서 22.1%로 향상되는 것을 보여줍니다.

Original Abstract

The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.

8 Citations

0 Influential

6 Altmetric

38.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!