2606.11042v1 Jun 09, 2026 cs.AI

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Wangchunshu Zhou
Wangchunshu Zhou
Citations: 1,269
h-index: 21
Zaiyuan Wang
Zaiyuan Wang
Citations: 262
h-index: 6
Ge Zhang
Ge Zhang
Citations: 135
h-index: 6
Xinjie Chen
Xinjie Chen
Citations: 185
h-index: 4
Zhixin Yao
Zhixin Yao
Citations: 77
h-index: 3
Wenhao Huang
Wenhao Huang
Citations: 71
h-index: 4
Yuhao Jiang
Yuhao Jiang
Citations: 0
h-index: 0
Fangzhi Xu
Fangzhi Xu
Citations: 666
h-index: 10
Jiaheng Liu
Jiaheng Liu
Citations: 963
h-index: 17
Haodong Duan
Haodong Duan
Citations: 91
h-index: 5
Liya Zhu
Liya Zhu
Citations: 233
h-index: 3
Xiang Gao
Xiang Gao
Citations: 88
h-index: 5
Kaiyuan Zhang
Kaiyuan Zhang
Citations: 54
h-index: 3
Chenchen Zhang
Chenchen Zhang
Citations: 55
h-index: 3
Jingzhe Ding
Jingzhe Ding
Citations: 64
h-index: 4
Jian Zhang
Jian Zhang
Citations: 13
h-index: 2
Jian Xue
Jian Xue
Citations: 37
h-index: 2
Shihao Liang
Shihao Liang
Citations: 877
h-index: 6
Yi Zhu
Yi Zhu
Citations: 7
h-index: 1
Duju Zeng
Duju Zeng
Citations: 0
h-index: 0
Qingshui Gu
Qingshui Gu
Citations: 105
h-index: 3
M. Gao
M. Gao
Citations: 0
h-index: 0
Huimin Che
Huimin Che
Citations: 35
h-index: 2
Yan Zhao
Yan Zhao
Citations: 14
h-index: 3
Peiheng Zhou
Peiheng Zhou
Citations: 141
h-index: 4
Haojun Wang
Haojun Wang
Citations: 86
h-index: 4
Chao Xian
Chao Xian
Citations: 0
h-index: 0
Li Le
Li Le
Citations: 10
h-index: 1
Chih-Kung Wu
Chih-Kung Wu
Citations: 1
h-index: 1
Shengda Long
Shengda Long
Citations: 30
h-index: 2
Jiale Yang
Jiale Yang
Citations: 17
h-index: 2
Siji Wu
Siji Wu
Citations: 0
h-index: 0
Chaofan He
Chaofan He
Citations: 3
h-index: 1
Zhaojian Li
Zhaojian Li
Citations: 701
h-index: 6
Minchao Wang
Minchao Wang
Citations: 604
h-index: 2
Huan Zhou
Huan Zhou
Citations: 28
h-index: 2
Jiani Hou
Jiani Hou
Citations: 4
h-index: 1
Chu Yu
Chu Yu
Citations: 12
h-index: 1
Weiran Shi
Weiran Shi
Citations: 24
h-index: 2
Hongwan Gao
Hongwan Gao
Citations: 42
h-index: 2
Jiamin Chen
Jiamin Chen
Citations: 108
h-index: 5
Guanhong Chen
Guanhong Chen
Citations: 23
h-index: 3
Ting Luo
Ting Luo
Citations: 9
h-index: 1
Qin-Wen Hua
Qin-Wen Hua
Citations: 0
h-index: 0
Jin Chen
Jin Chen
Citations: 0
h-index: 0
Pufan Chen
Pufan Chen
Citations: 10
h-index: 1
Zhenyuan Hu
Zhenyuan Hu
Citations: 1
h-index: 1
Xingyu Li
Xingyu Li
Citations: 12
h-index: 2
Zhe Jiang
Zhe Jiang
Citations: 57
h-index: 3
Meng Cao
Meng Cao
Citations: 44
h-index: 5
Tianfeng Long
Tianfeng Long
Citations: 13
h-index: 1
Haozhe Wang
Haozhe Wang
Citations: 31
h-index: 2
Mingzhan Wang
Mingzhan Wang
Citations: 5
h-index: 2
Yichen Zhang
Yichen Zhang
Citations: 11
h-index: 2
Yi Dai
Yi Dai
Citations: 30
h-index: 2
Jiaying Wang
Jiaying Wang
Citations: 110
h-index: 5
Xin-yi Liu
Xin-yi Liu
Citations: 0
h-index: 0
Xingzu Liu
Xingzu Liu
Citations: 5
h-index: 2
Lingling Zhang
Lingling Zhang
Citations: 7
h-index: 2
Yujia Qin
Yujia Qin
Citations: 696
h-index: 7
Zhiyong Wu
Zhiyong Wu
Citations: 2,179
h-index: 11
Yang Liu
Yang Liu
Citations: 116
h-index: 4
Lei Zhang
Lei Zhang
Citations: 10
h-index: 1
Shen Yan
Shen Yan
Citations: 15
h-index: 1
Xiaolong Chang
Xiaolong Chang
Citations: 0
h-index: 0
Yiwei Liu
Yiwei Liu
Citations: 3
h-index: 1

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

0 Citations
0 Influential
10.5 Altmetric
52.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!