2604.13888v1 Apr 15, 2026 cs.AI

GeoAgentBench: 공간 분석을 위한 도구 기반 에이전트의 동적 실행 벤치마크

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

Bo Yu

Citations: 129

h-index: 5

Dongyang Hou

Citations: 1

h-index: 1

Zhimin Zhang

Citations: 4

h-index: 1

Haifeng Li

Citations: 109

h-index: 2

Jiayao Liu

Citations: 16

h-index: 2

Wentao Yang

Citations: 1

h-index: 1

Chen Yang

Citations: 2

h-index: 1

Cheng Liu

Citations: 67

h-index: 3

Chi Wang

Citations: 266

h-index: 6

대규모 언어 모델(LLM)을 지리 정보 시스템(GIS)에 통합하는 것은 자율적인 공간 분석으로의 패러다임 전환을 의미합니다. 그러나 복잡하고 다단계적인 지리공간 워크플로우의 특성으로 인해 이러한 LLM 기반 에이전트를 평가하는 것은 여전히 어려운 과제입니다. 기존 벤치마크는 주로 정적인 텍스트 또는 코드 매칭에 의존하며, 동적인 런타임 피드백과 공간 출력의 다중 모드 특성을 고려하지 않습니다. 이러한 격차를 해소하기 위해, 우리는 도구 기반 GIS 에이전트를 위한 동적이고 상호작용적인 평가 벤치마크인 GeoAgentBench (GABench)를 소개합니다. GABench는 117개의 기본 GIS 도구를 통합한 실제 실행 환경을 제공하며, 6개의 핵심 GIS 영역에 걸쳐 53가지의 일반적인 공간 분석 작업을 포함합니다. 동적 GIS 환경에서 정확한 매개변수 구성이 실행 성공의 가장 중요한 결정 요인임을 인식하고, 암시적 매개변수 추론의 정확성을 정량화하기 위해 "최종 시도 정렬(Last-Attempt Alignment)" 전략을 사용하는 매개변수 실행 정확도(PEA) 지표를 설계했습니다. 이를 보완하기 위해, 데이터-공간 정확성과 지도 스타일 준수 여부를 평가하기 위한 비전-언어 모델(VLM) 기반 검증 방법을 제안합니다. 또한, 매개변수 불일치 및 런타임 이상으로 인해 발생하는 빈번한 작업 실패 문제를 해결하기 위해, 전문가의 인지 워크플로우를 모방하여 전역적인 조정과 단계별 반응 실행을 분리하는 새로운 에이전트 아키텍처인 Plan-and-React를 개발했습니다. 7개의 대표적인 LLM을 사용한 광범위한 실험 결과, Plan-and-React 패러다임은 기존 프레임워크보다 훨씬 우수한 성능을 보이며, 특히 다단계 추론 및 오류 복구에서 논리적 엄격성과 실행 안정성 간의 최적의 균형을 달성하는 것으로 나타났습니다. 우리의 연구 결과는 현재 기술의 한계를 보여주고, 차세대 자율 GeoAI를 평가하고 발전시키기 위한 견고한 표준을 제시합니다.

Original Abstract

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last-Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!