2605.27141v1 May 26, 2026 cs.AI

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Yaorui Shi
Yaorui Shi
Citations: 300
h-index: 9
Yuxin Chen
Yuxin Chen
Citations: 400
h-index: 8
Yaqi Huo
Yaqi Huo
Citations: 82
h-index: 3
Tat-Seng Chua
Tat-Seng Chua
Citations: 3,271
h-index: 31
Jingnan Zheng
Jingnan Zheng
Citations: 109
h-index: 5
Chenhang Cui
Chenhang Cui
Citations: 35
h-index: 3
Zhengzhou Cai
Zhengzhou Cai
Citations: 32
h-index: 3
Y. Zhang
Y. Zhang
Citations: 169
h-index: 7
Xunliang Cai
Xunliang Cai
Citations: 231
h-index: 8
Zhiyuan Yao
Zhiyuan Yao
Citations: 34
h-index: 2
Qi Gu
Qi Gu
Citations: 51
h-index: 5
An Zhang
An Zhang
Citations: 50
h-index: 4
Xiaohan Su
Xiaohan Su
Citations: 3
h-index: 1
Xiang Wang
Xiang Wang
Citations: 77
h-index: 2

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

1 Citations
0 Influential
15.5 Altmetric
78.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!