2605.29486v1 May 28, 2026 cs.CL

PhoneWorld: Scaling Phone-Use Agent Environments

Junyi Li
Junyi Li
Citations: 20
h-index: 3
Xingran Zhou
Xingran Zhou
Citations: 80
h-index: 2
Y. Zhang
Y. Zhang
Citations: 169
h-index: 7
Yuxuan Liu
Yuxuan Liu
Renmin University of China
Citations: 113
h-index: 6
Zhengyang Tang
Zhengyang Tang
Citations: 663
h-index: 8
Yi Guo
Yi Guo
Citations: 152
h-index: 4
X. Lai
X. Lai
Citations: 89
h-index: 2
Pengyuan Lyu
Pengyuan Lyu
Citations: 3,522
h-index: 17
Chengquan Zhang
Chengquan Zhang
Citations: 30
h-index: 4
Benyou Wang
Benyou Wang
Citations: 656
h-index: 7
Fei Tang
Fei Tang
Citations: 184
h-index: 6
Weinong Wang
Weinong Wang
Citations: 149
h-index: 5
Yang Ding
Yang Ding
Citations: 27
h-index: 2
Hua Shen
Hua Shen
Citations: 14
h-index: 2
Zhengyao Fang
Zhengyao Fang
Citations: 21
h-index: 3
Sunqi Fan
Sunqi Fan
Citations: 13
h-index: 2
Shangpin Peng
Shangpin Peng
Harbin Institute of Technology, Shenzhen
Citations: 38
h-index: 4
Zhenghao Ruan
Zhenghao Ruan
Citations: 0
h-index: 0
An Zhang
An Zhang
Citations: 50
h-index: 4
Jason
Jason
Citations: 49
h-index: 4
Liangxuan Wu
Liangxuan Wu
Citations: 33
h-index: 3
Rui Yan
Rui Yan
Citations: 608
h-index: 5
Jinan Wen
Jinan Wen
Citations: 1
h-index: 1
Han Hu
Han Hu
Citations: 28
h-index: 4

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

0 Citations
0 Influential
8.5 Altmetric
42.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!