2603.15483v1 Mar 16, 2026 cs.AI

대화, 평가, 진단: 사용자 중심 에이전트 평가 및 자동 오류 분석

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

Jiyuan Shen

Citations: 4

h-index: 1

A. Ghosh

Citations: 115

h-index: 2

Yifan Mai

Stanford University

Citations: 2,491

h-index: 15

Daniel Dahlmeier

Citations: 9

h-index: 2

Penny Chong

Citations: 1

h-index: 1

H. Abichandani

Citations: 3

h-index: 1

Min Pyae Moe

Citations: 1

h-index: 1

에이전트 애플리케이션은 다양한 작업의 워크플로우를 자동화하는 데 점점 더 많이 활용되고 있습니다. 그러나 에이전트가 운영되는 도메인이 이질적이기 때문에 확장 가능한 평가 프레임워크를 구축하는 것이 어렵습니다. 기존 연구들은 작업 성공 여부를 판단하기 위해 데이터베이스 조회, 정규식 매칭 등 다양한 방법을 사용하며, 이는 통합된 에이전트 평가 접근 방식 개발에 복잡성을 더합니다. 또한, 기존 연구들은 상호 작용에서 사용자의 역할과 전문성을 체계적으로 고려하지 않아 에이전트 성능에 대한 불완전한 통찰력을 제공합니다. 우리는 효과적인 에이전트 평가는 정확성뿐만 아니라 대화 품질, 효율성 및 에이전트 오류에 대한 체계적인 진단을 포함해야 한다고 주장합니다. 이를 해결하기 위해 우리는 TED 프레임워크(대화, 평가, 진단)를 소개합니다. (1) 대화: 재사용 가능한 일반적인 전문가 및 비전문가 사용자 페르소나 템플릿을 사용하여 사용자-에이전트 상호 작용을 시뮬레이션합니다. (2) 평가: 기존 데이터 세트를 수정하여 하위 목표(예: 도구 시그니처) 및 응답을 자연어 평가 노트로 표현하고, LLM(Large Language Model)을 사용하여 자동으로 평가합니다. 우리는 사용자 중심 설정을 보완하는 새로운 지표를 제안하여 에이전트의 효율성과 중간 진행 상황을 모두 측정합니다. (3) 진단: 판사와 에이전트 간의 불일치를 분석하여 일반적인 오류를 파악하고 에이전트 개선을 위한 실질적인 피드백을 제공하는 자동 오류 분석 도구를 소개합니다. 우리는 우리의 TED 프레임워크가 다양한 모델 및 사용자 전문성 수준에서 에이전트 성능에 대한 새로운 통찰력을 제공한다는 것을 보여줍니다. 또한, 파악된 오류 수정 사항을 에이전트 설계에 통합함으로써 제안된 지표에서 8~10%의 성능 향상을 달성할 수 있음을 보여줍니다.

Original Abstract

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.

1 Citations

0 Influential

7.5 Altmetric

38.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!