2606.11042v1 Jun 09, 2026 cs.AI

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Wangchunshu Zhou

Citations: 1,269

h-index: 21

Zaiyuan Wang

Citations: 262

h-index: 6

Ge Zhang

Citations: 135

h-index: 6

Xinjie Chen

Citations: 185

h-index: 4

Zhixin Yao

Citations: 77

h-index: 3

Wenhao Huang

Citations: 71

h-index: 4

Yuhao Jiang

Citations: 0

h-index: 0

Fangzhi Xu

Citations: 666

h-index: 10

Jiaheng Liu

Citations: 963

h-index: 17

Haodong Duan

Citations: 91

h-index: 5

Liya Zhu

Citations: 233

h-index: 3

Xiang Gao

Citations: 88

h-index: 5

Kaiyuan Zhang

Citations: 54

h-index: 3

Chenchen Zhang

Citations: 55

h-index: 3

Jingzhe Ding

Citations: 64

h-index: 4

Jian Zhang

Citations: 13

h-index: 2

Jian Xue

Citations: 37

h-index: 2

Shihao Liang

Citations: 877

h-index: 6

Yi Zhu

Citations: 7

h-index: 1

Duju Zeng

Citations: 0

h-index: 0

Qingshui Gu

Citations: 105

h-index: 3

M. Gao

Citations: 0

h-index: 0

Huimin Che

Citations: 35

h-index: 2

Yan Zhao

Citations: 14

h-index: 3

Peiheng Zhou

Citations: 141

h-index: 4

Haojun Wang

Citations: 86

h-index: 4

Chao Xian

Citations: 0

h-index: 0

Li Le

Citations: 10

h-index: 1

Chih-Kung Wu

Citations: 1

h-index: 1

Shengda Long

Citations: 30

h-index: 2

Jiale Yang

Citations: 17

h-index: 2

Siji Wu

Citations: 0

h-index: 0

Chaofan He

Citations: 3

h-index: 1

Zhaojian Li

Citations: 701

h-index: 6

Minchao Wang

Citations: 604

h-index: 2

Huan Zhou

Citations: 28

h-index: 2

Jiani Hou

Citations: 4

h-index: 1

Chu Yu

Citations: 12

h-index: 1

Weiran Shi

Citations: 24

h-index: 2

Hongwan Gao

Citations: 42

h-index: 2

Jiamin Chen

Citations: 108

h-index: 5

Guanhong Chen

Citations: 23

h-index: 3

Ting Luo

Citations: 9

h-index: 1

Qin-Wen Hua

Citations: 0

h-index: 0

Jin Chen

Citations: 0

h-index: 0

Pufan Chen

Citations: 10

h-index: 1

Zhenyuan Hu

Citations: 1

h-index: 1

Xingyu Li

Citations: 12

h-index: 2

Zhe Jiang

Citations: 57

h-index: 3

Meng Cao

Citations: 44

h-index: 5

Tianfeng Long

Citations: 13

h-index: 1

Haozhe Wang

Citations: 31

h-index: 2

Mingzhan Wang

Citations: 5

h-index: 2

Yichen Zhang

Citations: 11

h-index: 2

Yi Dai

Citations: 30

h-index: 2

Jiaying Wang

Citations: 110

h-index: 5

Xin-yi Liu

Citations: 0

h-index: 0

Xingzu Liu

Citations: 5

h-index: 2

Lingling Zhang

Citations: 7

h-index: 2

Yujia Qin

Citations: 696

h-index: 7

Zhiyong Wu

Citations: 2,179

h-index: 11

Yang Liu

Citations: 116

h-index: 4

Lei Zhang

Citations: 10

h-index: 1

Shen Yan

Citations: 15

h-index: 1

Xiaolong Chang

Citations: 0

h-index: 0

Yiwei Liu

Citations: 3

h-index: 1

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!