2605.26086v1 May 25, 2026 cs.AI

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Feiyang Pan
Feiyang Pan
Citations: 37
h-index: 4
Dandan Tu
Dandan Tu
Citations: 50
h-index: 3
Yusong Lin
Yusong Lin
Citations: 36
h-index: 3
Haiyang Wang
Haiyang Wang
Citations: 20
h-index: 2
Shuzhe Wu
Shuzhe Wu
Citations: 18
h-index: 2
Lu Fan
Lu Fan
Citations: 0
h-index: 0
Xinyu Liang
Xinyu Liang
Citations: 20
h-index: 3
Qi Gu
Qi Gu
Citations: 51
h-index: 5
Siqi Cheng
Siqi Cheng
Citations: 3
h-index: 1
Jiangui Chen
Jiangui Chen
Citations: 395
h-index: 8
Sanyuan Zhao
Sanyuan Zhao
Citations: 1,438
h-index: 14

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

0 Citations
0 Influential
7 Altmetric
35.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!