2605.25624v1 May 25, 2026 cs.AI

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Shuai Bai
Shuai Bai
Citations: 21,355
h-index: 20
Shixuan Liu
Shixuan Liu
Citations: 8,175
h-index: 6
Junyang Lin
Junyang Lin
Citations: 8,116
h-index: 19
Haoyi Hu
Haoyi Hu
Citations: 106
h-index: 4
Junlin Wang
Junlin Wang
Duke University
Citations: 891
h-index: 11
Tao Yu
Tao Yu
Citations: 422
h-index: 3
Bowen Wang
Bowen Wang
Citations: 78
h-index: 4
Dunjie Lu
Dunjie Lu
Citations: 1,654
h-index: 5
Tianyi Bai
Tianyi Bai
Citations: 222
h-index: 8
Zhipeng Zhang
Zhipeng Zhang
Citations: 14
h-index: 3
Haiquan Wang
Haiquan Wang
Citations: 5
h-index: 1
Tianbao Xie
Tianbao Xie
Citations: 6,451
h-index: 8
Dayiheng Liu
Dayiheng Liu
Citations: 105
h-index: 2
Quek Shen
Quek Shen
Citations: 0
h-index: 0

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

0 Citations
0 Influential
10 Altmetric
50.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!