2601.11854v2 Jan 17, 2026 cs.CL

ATOD: 에이전트 기반의 작업 지향 대화 시스템 평가 프레임워크 및 벤치마크

ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

Yujia Liu

Citations: 0

h-index: 0

Dilek Hakkani-Tur

Citations: 703

h-index: 12

Gokhan Tur

University of Illinois at Urbana Champaign

Citations: 3,583

h-index: 23

H. Nayyeri

Citations: 3,039

h-index: 30

R. Khaziev

Citations: 100

h-index: 5

Emine Yilmaz

Citations: 4

h-index: 1

Hari Thadakamalla

Citations: 3

h-index: 1

최근 대규모 언어 모델(LLM)과 광범위한 API 및 도구 통합을 통해 발전된 작업 지향 대화(TOD) 시스템은 대화형 에이전트가 복잡한 목표를 조정하고, 장기적인 문맥을 유지하며, 비동기 실행을 통해 능동적으로 행동할 수 있도록 지원합니다. 이러한 기능은 기존 TOD 시스템을 넘어선 발전이지만, 현재의 벤치마크는 이러한 에이전트의 행동을 체계적으로 평가하는 데 한계가 있습니다. 이러한 격차를 해소하기 위해, 우리는 장기적인 추론 능력을 요구하는 풍부하게 주석이 달린 대화를 생성하는 벤치마크 및 합성 대화 생성 파이프라인인 ATOD를 소개합니다. ATOD는 다중 목표 조정, 의존성 관리, 기억, 적응성, 능동성 등 고급 TOD의 주요 특징을 포괄합니다. ATOD를 기반으로, 우리는 이러한 측면을 세분화된 지표로 변환하고 재현 가능한 오프라인 및 온라인 평가를 지원하는 종합적인 평가 프레임워크인 ATOD-Eval을 제안합니다. 또한, ATOD 벤치마킹을 위한 강력한 에이전트 기반의 메모리 기반 평가 도구를 제시합니다. 실험 결과, ATOD-Eval은 작업 완료, 에이전트 기능, 응답 품질에 대한 종합적인 평가를 가능하게 하며, 제안된 평가 도구는 기존의 메모리 기반 및 LLM 기반 접근 방식과 비교하여 더 나은 정확도-효율성 균형을 제공하는 것으로 나타났습니다.

Original Abstract

Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.

0 Citations

0 Influential

15 Altmetric

75.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!