2601.17087v2 Jan 23, 2026 cs.HC

시뮬레이션의 함정: LLM 시뮬레이션 사용자가 에이전트 평가에서 인간 사용자를 신뢰할 수 있는 대리자로 기능하지 못함

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

P. Seshadri

Citations: 107

h-index: 4

Samuel Cahyawijaya

Citations: 660

h-index: 12

Ayomide Odumakinde

Citations: 11

h-index: 3

Sameer Singh

Citations: 42

h-index: 4

Seraphina Goldfarb-Tarrant

Citations: 19

h-index: 3

에이전트 벤치마크는 에이전트 성능을 효율적으로 평가하기 위해 점점 더 LLM 시뮬레이션 사용자를 활용하고 있지만, 이러한 접근 방식의 안정성, 타당성 및 공정성은 아직 검토되지 않았습니다. 본 연구는 미국, 인도, 케냐 및 나이지리아의 참가자들을 대상으로 한 사용자 연구를 통해 LLM 시뮬레이션 사용자가 τ-Bench 소매 작업에서 에이전트를 평가할 때 실제 인간 사용자를 얼마나 신뢰할 수 있는 대리자로 기능하는지 조사합니다. 연구 결과, 사용자 시뮬레이션은 안정성이 부족하며, 에이전트 성공률이 다양한 LLM 사용자 그룹에 따라 최대 9% 포인트까지 차이가 나는 것으로 나타났습니다. 또한, 시뮬레이션 사용자를 이용한 평가는 체계적인 오차를 나타내며, 어려운 작업에서는 에이전트 성능을 과소평가하고, 중간 난이도의 작업에서는 에이전트 성능을 과대평가하는 경향이 있습니다. 흑인 구어체(AAVE) 사용자들은 표준 미국 영어(SAE) 사용자들보다 일관되게 낮은 성공률과 더 큰 보정 오류를 경험했으며, 이러한 격차는 나이에 따라 더욱 심화됩니다. 또한, 시뮬레이션 사용자는 다양한 인구 집단에 대해 다른 수준의 신뢰성을 가지며, AAVE 사용자 및 인도 영어 사용자에게 가장 낮은 성능을 보입니다. 게다가, 시뮬레이션 사용자는 대화형 오류를 발생시키고, 인간 사용자와는 다른 실패 패턴을 드러냅니다. 이러한 결과는 현재의 평가 방식이 다양한 사용자 집단에 걸쳐 에이전트의 능력을 잘못 나타낼 위험이 있으며, 실제 배포 과정에서 발생할 수 있는 문제들을 가릴 수 있음을 시사합니다.

Original Abstract

Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on τ-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patterns than human users. These findings demonstrate that current evaluation practices risk misrepresenting agent capabilities across diverse user populations and may obscure real-world deployment challenges.

4 Citations

0 Influential

6 Altmetric

34.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!