2606.13608v1 Jun 11, 2026 cs.AI

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Jianhong Tu
Jianhong Tu
Citations: 21
h-index: 2
Elron Bandel
Elron Bandel
IBM Research
Citations: 332
h-index: 9
Xi Zhang
Xi Zhang
Citations: 5,504
h-index: 10
Alexandre Lacoste
Alexandre Lacoste
Citations: 723
h-index: 11
Victor Barres
Victor Barres
Citations: 308
h-index: 4
Tianneng Shi
Tianneng Shi
Citations: 506
h-index: 11
Ramayya Krishnan
Ramayya Krishnan
Citations: 10
h-index: 1
Donghyun Lee
Donghyun Lee
Citations: 161
h-index: 3
Siva Reddy
Siva Reddy
Citations: 3
h-index: 1
Yue Su
Yue Su
Citations: 17
h-index: 2
Wenbo Guo
Wenbo Guo
Citations: 288
h-index: 10
Michal Shmueli-Scheuer
Michal Shmueli-Scheuer
Citations: 1,031
h-index: 17
Xiaoyuan Liu
Xiaoyuan Liu
Citations: 790
h-index: 8
Gal Gantar
Gal Gantar
Citations: 0
h-index: 0
Evan Sandoval
Evan Sandoval
Citations: 0
h-index: 0
Daniela Miao
Daniela Miao
Citations: 93
h-index: 2
P. Gilbert
P. Gilbert
Citations: 3,680
h-index: 11
Nicholas Hynes
Nicholas Hynes
Citations: 1,395
h-index: 7
Mauro Staver
Mauro Staver
Citations: 0
h-index: 0
Warren He
Warren He
Citations: 3,854
h-index: 16
David Marn
David Marn
Citations: 150
h-index: 5
Andrew Low
Andrew Low
Citations: 184
h-index: 3
Alexandre Drouin
Alexandre Drouin
Citations: 925
h-index: 11
Elham Tabassi
Elham Tabassi
Citations: 4
h-index: 1
Yuqi Chen
Yuqi Chen
Citations: 2
h-index: 1
Siyu Xie
Siyu Xie
Citations: 25
h-index: 4
Sihan Ren
Sihan Ren
Citations: 2
h-index: 1
Chenguang Wang
Chenguang Wang
Citations: 319
h-index: 4
D. Song
D. Song
Citations: 109
h-index: 6

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

0 Citations
0 Influential
8.5 Altmetric
42.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!