2606.10394v1 Jun 09, 2026 cs.AI

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

Sirui Liang
Sirui Liang
Citations: 14
h-index: 2
Pengfei Cao
Pengfei Cao
Institute of Automation, Chinese Academy of Sciences
Citations: 1,719
h-index: 21
Ke Zeng
Ke Zeng
Citations: 106
h-index: 6
Xunliang Cai
Xunliang Cai
Citations: 83
h-index: 5
Jian Zhao
Jian Zhao
Citations: 195
h-index: 7
Kang Liu
Kang Liu
Citations: 218
h-index: 9
Bohan Yu
Bohan Yu
Citations: 61
h-index: 5
Peiyu Wang
Peiyu Wang
Citations: 42
h-index: 3
Shiguang Guo
Shiguang Guo
Citations: 136
h-index: 2
Wenxing Hu
Wenxing Hu
Citations: 6
h-index: 1
Cao Liu
Cao Liu
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Citations: 337
h-index: 8

Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.

0 Citations
0 Influential
10.5 Altmetric
52.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!