2604.22436v1 Apr 24, 2026 cs.AI

AgentSearchBench: 실제 환경에서의 AI 에이전트 검색을 위한 벤치마크

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu

University College London

Citations: 255

h-index: 8

Arastun Mammadli

Citations: 0

h-index: 0

Xiaoyu Zhang

Citations: 100

h-index: 1

Emine Yilmaz

Citations: 63

h-index: 4

AI 에이전트 생태계의 급속한 성장은 복잡한 작업의 위임 및 실행 방식을 변화시키고 있으며, 이는 특정 작업에 적합한 에이전트를 식별하는 새로운 과제를 야기합니다. 기존 도구와 달리, 에이전트의 기능은 종종 구성적이고 실행 의존적이기 때문에, 텍스트 설명만으로는 평가하기 어렵습니다. 그러나 기존 연구 및 벤치마크는 일반적으로 명확하게 정의된 기능, 제어된 후보 풀 또는 실행 가능한 작업 쿼리만을 가정하며, 실제 에이전트 검색 시나리오는 충분히 연구되지 않았습니다. 본 논문에서는 다양한 제공업체의 거의 1만 개의 실제 에이전트를 기반으로 구축된, 실제 환경에서의 에이전트 검색을 위한 대규모 벤치마크인 AgentSearchBench를 소개합니다. 벤치마크는 에이전트 검색을 실행 가능한 작업 쿼리와 고수준 작업 설명 모두에서 검색 및 재순위화 문제로 공식화하고, 실행 기반 성능 지표를 사용하여 관련성을 평가합니다. 실험 결과는 의미 유사성과 실제 에이전트 성능 간의 일관된 격차를 보여주며, 설명 기반 검색 및 재순위화 방법의 한계를 드러냅니다. 또한, 실행 의식을 포함한 경량 행동 신호가 순위 품질을 크게 향상시킬 수 있음을 보여주며, 에이전트 검색에 실행 신호를 통합하는 것의 중요성을 강조합니다. 저희의 코드는 https://github.com/Bingo-W/AgentSearchBench 에서 확인할 수 있습니다.

Original Abstract

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.

0 Citations

0 Influential

32.047189562171 Altmetric

160.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!