2606.12817v1 Jun 11, 2026 cs.AI

Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

Jiawei Liu
Jiawei Liu
Citations: 40
h-index: 3
Xingyu Bruce Liu
Xingyu Bruce Liu
UCLA
Citations: 417
h-index: 8
Ltd.
Ltd.
Citations: 24
h-index: 2
Daoyang Liu
Daoyang Liu
Citations: 0
h-index: 0
Yangfan Luo
Yangfan Luo
Citations: 7
h-index: 1
Zhilin Gao Honor Device Co.
Zhilin Gao Honor Device Co.
Citations: 0
h-index: 0
T. Kong
T. Kong
Citations: 35
h-index: 1
Hong Kong
Hong Kong
Citations: 8
h-index: 2
China
China
Citations: 1,663
h-index: 6
Yudong Zhang
Yudong Zhang
Citations: 78
h-index: 3
Lei Hu
Lei Hu
Citations: 1
h-index: 1
Zuojian Wang
Zuojian Wang
Citations: 5
h-index: 2

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

0 Citations
0 Influential
4 Altmetric
20.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!