2606.05670v1 Jun 04, 2026 cs.AI

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

Jiaqi Shao
Jiaqi Shao
Citations: 89
h-index: 4
Bing Luo
Bing Luo
Citations: 83
h-index: 4
Yuhan Fu
Yuhan Fu
Citations: 2
h-index: 1
Ruishan Fang
Ruishan Fang
Citations: 6
h-index: 2
Tao Lin
Tao Lin
Citations: 28
h-index: 2
Huiyuan Zheng
Huiyuan Zheng
Citations: 109
h-index: 3
Zhengtao Zhu
Zhengtao Zhu
Citations: 0
h-index: 0

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

0 Citations
0 Influential
2 Altmetric
10.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!