2605.07699v1 May 08, 2026 cs.CL

DRIP-R: 실제 환경의 정책 모호성 하에서의 의사 결정 및 추론을 위한 벤치마크 (리테일 도메인)

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

Hsuvas Borkakoty

Citations: 267

h-index: 6

Sebastian Pohl

Citations: 7

h-index: 2

Cheng Wang

Citations: 5

h-index: 2

Bei Chen

Citations: 24

h-index: 2

Yufang Hou

Citations: 117

h-index: 7

LLM 기반 에이전트들이 실제 환경의 일상적이지만 중요한 업무에 점점 더 많이 활용되고 있으며, 이러한 에이전트의 행동은 여러 가지 유효한 해석을 허용하는 본질적으로 모호한 정책에 의해 규제됩니다. 하지만 실제 환경에서 이러한 모호성이 널리 존재함에도 불구하고, 기존의 에이전트 벤치마크는 대부분 명확하고 잘 정의된 정책을 가정하여 중요한 평가 격차를 야기합니다. 우리는 DRIP-R을 소개합니다. DRIP-R은 실제 리테일 정책의 모호성을 체계적으로 활용하여, 단 하나의 정답이 존재하지 않는 시나리오를 구성하는 벤치마크입니다. DRIP-R은 정책적으로 모호한 반품 시나리오와 현실적인 고객 페르소나를 결합하고, 툴 호출 기능을 갖춘 양방향 대화 시뮬레이션과 정책 준수, 대화 품질, 행동 일관성 및 해결 품질을 평가하는 다중 심사 프레임워크를 포함합니다. 우리의 실험 결과는 최첨단 모델들이 동일한 정책적으로 모호한 시나리오에 대해 근본적으로 다른 의견을 제시한다는 것을 보여주며, 이는 모호성이 LLM의 의사 결정에 진정으로 중요한 과제를 제시한다는 것을 확인합니다.

Original Abstract

LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!