2606.13608v1 Jun 11, 2026 cs.AI

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Jianhong Tu

Citations: 21

h-index: 2

Elron Bandel

IBM Research

Citations: 332

h-index: 9

Xi Zhang

Citations: 5,504

h-index: 10

Alexandre Lacoste

Citations: 723

h-index: 11

Victor Barres

Citations: 308

h-index: 4

Tianneng Shi

Citations: 506

h-index: 11

Ramayya Krishnan

Citations: 10

h-index: 1

Donghyun Lee

Citations: 161

h-index: 3

Siva Reddy

Citations: 3

h-index: 1

Yue Su

Citations: 17

h-index: 2

Wenbo Guo

Citations: 288

h-index: 10

Michal Shmueli-Scheuer

Citations: 1,031

h-index: 17

Xiaoyuan Liu

Citations: 790

h-index: 8

Gal Gantar

Citations: 0

h-index: 0

Evan Sandoval

Citations: 0

h-index: 0

Daniela Miao

Citations: 93

h-index: 2

P. Gilbert

Citations: 3,680

h-index: 11

Nicholas Hynes

Citations: 1,395

h-index: 7

Mauro Staver

Citations: 0

h-index: 0

Warren He

Citations: 3,854

h-index: 16

David Marn

Citations: 150

h-index: 5

Andrew Low

Citations: 184

h-index: 3

Alexandre Drouin

Citations: 925

h-index: 11

Elham Tabassi

Citations: 4

h-index: 1

Yuqi Chen

Citations: 2

h-index: 1

Siyu Xie

Citations: 25

h-index: 4

Sihan Ren

Citations: 2

h-index: 1

Chenguang Wang

Citations: 319

h-index: 4

D. Song

Citations: 109

h-index: 6

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!