2603.04370v1 Mar 04, 2026 cs.AI

τ-Knowledge: 비정형 지식을 활용한 대화형 에이전트 평가

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Alexandra Zytek

Citations: 265

h-index: 8

P. Razavi

Citations: 22

h-index: 3

Karthik R. Narasimhan

Citations: 389

h-index: 5

Victor Barres

Citations: 308

h-index: 4

Quan Shi

Citations: 307

h-index: 4

대화형 에이전트는 지식 집약적인 환경에서 점점 더 많이 활용되고 있으며, 이러한 환경에서는 사용자와의 실시간 상호작용 과정에서 대규모의 독점적이고 비정형 데이터에서 특정 분야의 지식을 검색하고 적용하는 것이 올바른 동작에 필수적입니다. 그러나 대부분의 기존 벤치마크는 검색 또는 도구 사용을 독립적으로 평가하여, 장기적인 상호작용에서 비정형 데이터를 활용한 실제적인 에이전트 평가에 대한 격차가 존재합니다. 본 연구에서는 에이전트가 외부의 자연어 지식을 도구의 출력과 연계하여 검증 가능하고 정책을 준수하는 상태 변화를 생성하는 환경에서 에이전트를 평가하기 위한 확장된 벤치마크인 τ-Knowledge를 제안합니다. 저희가 새롭게 개발한 환경인 τ-Banking은 실제 핀테크 고객 지원 워크플로우를 모델링하며, 에이전트는 약 700개의 상호 연결된 지식 문서를 탐색하면서 도구를 활용하여 계정 업데이트를 수행해야 합니다. 임베딩 기반 검색과 터미널 기반 검색 모두에서, 높은 추론 예산을 가진 최첨단 모델조차 약 25.5%의 성공률을 보일 뿐이며, 반복적인 시도에서 신뢰성은 급격하게 저하됩니다. 에이전트는 밀접하게 연결된 지식 베이스에서 올바른 문서를 검색하고, 복잡한 내부 정책에 대해 정확하게 추론하는 데 어려움을 겪습니다. 전반적으로, τ-Knowledge는 인간과 상호 작용하는 환경에서 비정형 지식을 통합하는 에이전트를 개발하기 위한 현실적인 테스트 환경을 제공합니다.

Original Abstract

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$-Knowledge, an extension of $τ$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $τ$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

4 Citations

0 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!