2604.09251v1 Apr 10, 2026 cs.AI

DRBENCHER: 당신의 에이전트는 엔티티를 식별하고, 속성을 검색하고, 계산을 수행할 수 있습니까?

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

R. Astudillo

Citations: 50

h-index: 4

Young-Suk Lee

Citations: 287

h-index: 9

Radu Florian

Citations: 95

h-index: 6

최근의 심층 연구 에이전트는 웹 검색과 다단계 계산을 결합하여 수행하지만, 기존의 벤치마크는 이러한 기능을 개별적으로 평가하여 실제 성능을 평가하는 데 어려움이 있습니다. 본 논문에서는 웹 검색과 계산이 모두 필요한 질문들을 위한 합성 벤치마크 생성기인 DRBENCHER를 소개합니다. DRBENCHER는 다음 네 가지 기준을 적용합니다. (1) 검증 가능성 (정답은 지식 그래프 값을 기반으로 파라미터화된 코드를 실행하여 계산), (2) 복잡성 (다중 단계 엔티티 식별, 속성 검색, 도메인별 계산), (3) 난이도 (두 단계의 검증 시스템을 통해 생성 모델로 해결 가능한 질문을 제거), (4) 다양성 (탐욕적 최대-최소 임베딩 필터를 사용하여 다양성을 극대화). 이러한 기준은 생화학, 금융, 지구물리학, 보안, 역사 등 5개 도메인을 포괄하는 통합된 답변 우선 파이프라인을 통해 구현됩니다. 인간 평가 결과, 76%의 유효성(유효하지 않은 데이터 제외 시 84%)을 보였으며, 오류의 35%는 오래된 지식 그래프 항목으로 인해 발생하여, 변화하는 데이터에 대한 추론을 수행하는 시스템의 근본적인 한계를 보여줍니다. 자동 평가 결과, 현재 가장 뛰어난 모델의 답변 정확도는 20%에 불과했습니다. 수동으로 구축된 벤치마크(BrowseComp+, MATH-500, GPQA)와 비교했을 때, DRBENCHER는 가장 높은 의미적 다양성을 제공합니다.

Original Abstract

Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!