2603.13686v1 Mar 14, 2026 cs.SD

τ-Voice: 실제 환경 도메인에서의 풀-듀플렉스 음성 에이전트 성능 평가

$τ$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

Karthik R. Narasimhan

Citations: 389

h-index: 5

Victor Barres

Citations: 308

h-index: 4

Soham Ray

Citations: 280

h-index: 3

Keshav Dhandhania

Citations: 18

h-index: 3

풀-듀플렉스 음성 에이전트(동시에 듣고 말하는 시스템)는 연구 단계에서 실제 적용 단계로 빠르게 전환되고 있습니다. 그러나 기존의 평가는 주로 대화 흐름과 작업 완료만을 개별적으로 다룹니다. 본 연구에서는 실제 환경의 복잡성을 반영한 작업 수행 능력을 평가하기 위한 벤치마크인 τ-voice를 소개합니다. τ-voice는 에이전트가 복잡한 다중 턴 대화를 수행하고, 특정 도메인의 정책을 준수하며, 환경과 상호 작용해야 합니다. 이 프레임워크는 τ²-bench를 확장하여 복잡한 작업의 검증 가능한 완료, 풀-듀플렉스 상호 작용, 그리고 현실적인 오디오를 결합한 새로운 음성 에이전트 벤치마크를 제공하며, 이를 통해 음성 및 텍스트 성능을 직접 비교할 수 있습니다. 제어 가능하고 현실적인 음성 사용자 시뮬레이터는 다양한 억양, 현실적인 오디오 환경, 그리고 풍부한 턴-테이킹(발화 교대) 역학을 제공합니다. 또한, 시뮬레이션을 실제 시간 제약에서 분리함으로써, 이 사용자 시뮬레이터는 가장 강력한 LLM을 실시간 제약 없이 사용할 수 있습니다. 우리는 278개의 작업에서 작업 완료율(pass@1)과 음성 상호 작용 품질을 평가했습니다. GPT-5 (추론)의 경우 85%의 성능을 보이는 반면, 음성 에이전트는 깨끗한 환경에서는 31~51%, 현실적인 환경(잡음 및 다양한 억양 포함)에서는 26~38%의 성능을 보이며, 이는 텍스트 성능의 30~45% 수준입니다. 질적 분석 결과, 실패의 79~90%가 에이전트의 행동에서 비롯된 것으로 나타났으며, 이는 관찰된 실패가 주로 평가 환경에서의 에이전트 행동을 반영한다는 것을 시사합니다. τ-voice는 자연스럽고, 대화적이며, 신뢰할 수 있는 음성 에이전트 개발을 위한 진행 상황을 측정할 수 있는 재현 가능한 테스트 환경을 제공합니다.

Original Abstract

Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $τ$-voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends $τ^2$-bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio--enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 85%, voice agents reach only 31--51% under clean conditions and 26--38% under realistic conditions with noise and diverse accents--retaining only 30--45% of text capability; qualitative analysis confirms 79--90% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. $τ$-voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.

4 Citations

0 Influential

2.5 Altmetric

16.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!