2601.20617v1 Jan 28, 2026 cs.CY

에이전트 벤치마크가 공공 부문 요구사항을 충족하지 못함

Agent Benchmarks Fail Public Sector Requirements

Jan Batzner

Citations: 156

h-index: 7

Jonathan Rystrøm

Citations: 242

h-index: 5

Chris Schmitz

Citations: 66

h-index: 2

Karolina Korgul

Citations: 73

h-index: 3

Chris Russell

Citations: 12

h-index: 2

대규모 언어 모델(LLM) 기반 에이전트를 공공 부문에 적용하기 위해서는 해당 에이전트가 공공 기관의 엄격한 법률, 절차 및 구조적 요구사항을 충족하는지 확인해야 합니다. 실무자와 연구자들은 종종 이러한 평가를 위해 벤치마크를 활용합니다. 그러나 벤치마크가 공공 부문 요구사항을 적절하게 반영하도록 하기 위해 어떤 기준을 충족해야 하는지, 그리고 현재 벤치마크 중 얼마나 많은 것이 이러한 기준을 충족하는지에 대한 명확성은 여전히 부족합니다. 본 논문에서는 공공 행정 관련 문헌을 바탕으로 이러한 기준을 정의합니다. 벤치마크는 extit{프로세스 기반}, extit{현실적}, extit{공공 부문 특화}이며, 공공 부문의 고유한 요구사항을 반영하는 extit{지표}를 보고해야 합니다. 우리는 전문가 검증을 거친 LLM 기반 시스템을 활용하여 1,300편 이상의 벤치마크 논문을 분석했습니다. 분석 결과, 단 하나의 벤치마크도 모든 기준을 충족하지 못하는 것으로 나타났습니다. 이러한 연구 결과는 연구자들이 공공 부문과 관련된 벤치마크를 개발하고, 공공 부문 담당자들이 에이전트 활용 사례를 평가할 때 이러한 기준을 적용하도록 촉구하는 것입니다.

Original Abstract

Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!