2602.14201v2 Feb 15, 2026 cs.CV

GeoEyes: 증거 기반의 초고해상도 원격 감지 이미지 이해를 위한 온디맨드 시각적 집중 기술

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Fengxiang Wang

Citations: 124

h-index: 7

Mingshuo Chen

Citations: 84

h-index: 4

Yueying Li

Citations: 24

h-index: 1

Yajie Yang

Citations: 0

h-index: 0

Di Wang

Citations: 147

h-index: 7

Yifan Zhang

Citations: 292

h-index: 7

Hongda Sun

Citations: 0

h-index: 0

Long Lan

Citations: 143

h-index: 8

Jun Song

Citations: 373

h-index: 8

Yulin Wang

Citations: 73

h-index: 3

Jing Zhang

Citations: 323

h-index: 8

Bo Du

Citations: 59

h-index: 5

Xue Yang

Citations: 46

h-index: 3

“이미지를 활용한 사고” 패러다임은 멀티모달 대규모 언어 모델(MLLM)이 확대 기능을 통해 시각적 장면을 능동적으로 탐색하도록 합니다. 이는 작업과 관련된 단서가 드물고 미세한 초고해상도(UHR) 원격 감지 시각 질의응답(VQA)에 필수적입니다. 그러나 기존의 확대 기능이 탑재된 MLLM에서 일관적으로 나타나는 문제점은 “도구 사용 균질화”입니다. 즉, 도구 호출이 작업에 독립적인 패턴으로 붕괴되어 효과적인 증거 획득을 제한합니다. 이를 해결하기 위해, 우리는 GeoEyes를 제안합니다. GeoEyes는 (1) 다양한 확대 수준을 포괄하는 초기 단계의 지도 학습 데이터셋인 UHR Chain-of-Zoom (UHR-CoZ)과 (2) 확대 상호 작용 동안 증거 획득과 답변 개선을 명시적으로 보상하는 에이전트 기반 강화 학습 방법인 AdaZoom-GRPO로 구성된 단계별 학습 프레임워크입니다. 결과적으로, 개발된 모델은 적절한 중단 동작을 갖춘 온디맨드 확대 기능을 학습하며, XLRS-Bench에서 54.23%의 정확도를 달성하는 등 UHR 원격 감지 벤치마크에서 상당한 성능 향상을 보였습니다.

Original Abstract

The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!