2602.22638v1 Feb 26, 2026 cs.AI

MobilityBench: 실제 이동 시나리오에서 경로 계획 에이전트 평가를 위한 벤치마크

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Kaikui Liu

Citations: 262

h-index: 5

Xiangxiang Chu

Citations: 28

h-index: 3

He Zhu

Citations: 185

h-index: 7

Jingshuai Zhang

Baidu Inc.

Citations: 269

h-index: 9

Chuan Qin

Citations: 1,198

h-index: 12

Chao Chen

Citations: 38

h-index: 4

Longfei Xu

Citations: 29

h-index: 3

Chao Wang

Citations: 23

h-index: 1

Zhiheng Song

Citations: 0

h-index: 0

대규모 언어 모델(LLM)을 기반으로 하는 경로 계획 에이전트는 자연어 상호 작용과 도구 기반 의사 결정을 통해 일상적인 인간의 이동을 지원하는 유망한 패러다임으로 부상했습니다. 그러나 다양한 경로 요구 사항, 비결정적 지도 서비스 및 제한된 재현성으로 인해 실제 이동 환경에서의 체계적인 평가는 어렵습니다. 본 연구에서는 실제 이동 시나리오에서 LLM 기반 경로 계획 에이전트를 평가하기 위한 확장 가능한 벤치마크인 MobilityBench를 소개합니다. MobilityBench는 Amap에서 수집한 대규모 익명 사용자 쿼리로 구성되며, 전 세계 여러 도시에서 다양한 경로 계획 의도를 포괄합니다. 재현 가능한 엔드투엔드 평가를 가능하게 하기 위해, 실제 서비스에서의 환경적 변동을 제거하는 결정적 API 재현 샌드박스를 설계했습니다. 또한, 결과의 유효성을 중심으로 한 다차원 평가 프로토콜을 제안하며, 지시 이해, 계획 수립, 도구 사용 및 효율성 평가를 포함합니다. MobilityBench를 사용하여 다양한 실제 이동 시나리오에서 여러 LLM 기반 경로 계획 에이전트를 평가하고, 그들의 행동 및 성능에 대한 심층적인 분석을 제공합니다. 우리의 연구 결과는 현재 모델이 기본적인 정보 검색 및 경로 계획 작업에서는 능숙하지만, 선호도 기반 경로 계획에서는 상당한 어려움을 겪으며, 이는 개인화된 이동 애플리케이션에서 개선될 여지가 많다는 것을 보여줍니다. 벤치마크 데이터, 평가 도구 및 문서를 https://github.com/AMAP-ML/MobilityBench 에서 공개적으로 제공합니다.

Original Abstract

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

0 Citations

0 Influential

50.414009612932 Altmetric

252.1 Score

Original PDF

131

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!