2605.27134v1 May 26, 2026 cs.AI

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Jian Luan
Jian Luan
Citations: 947
h-index: 13
Wei Liu
Wei Liu
Citations: 26
h-index: 3
Hengxu Qu
Hengxu Qu
Citations: 303
h-index: 5
Pengzhi Gao
Pengzhi Gao
Citations: 72
h-index: 5
Yike Liu
Yike Liu
Citations: 19
h-index: 2
Renren Jin
Renren Jin
Citations: 1
h-index: 1
Wenzong Zhang
Wenzong Zhang
Citations: 0
h-index: 0

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

0 Citations
0 Influential
6.5 Altmetric
32.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!