2602.00456v1 Jan 31, 2026 cs.AI

보험 언더라이팅 환경에서의 에이전트 벤치마킹

Benchmarking Agents in Insurance Underwriting Environments

Bhavishya Pohani

Citations: 0

h-index: 0

Christopher M Glaze

Citations: 7

h-index: 1

A. Dsouza

Citations: 2,633

h-index: 5

R. Ramakrishnan

Citations: 2

h-index: 1

Charles Dickens

Citations: 80

h-index: 2

AI 에이전트가 기업용 애플리케이션에 도입됨에 따라, 실제 업무의 복잡성을 반영한 벤치마크가 요구되고 있습니다. 하지만 기존 벤치마크들은 코드와 같은 오픈 도메인에 지나치게 치중되어 있고, 협소한 정확도 지표를 사용하며, 실질적인 업무의 복잡성을 담아내지 못하고 있습니다. 이에 우리는 실제 기업 환경의 난제를 포착하기 위해 도메인 전문가들과 긴밀히 협력하여 설계한, 전문가 중심의 멀티턴 보험 언더라이팅 벤치마크인 'UNDERWRITE'를 제안합니다. UNDERWRITE는 기존 벤치마크에서 흔히 간과되었던 핵심적인 현실적 요소들, 즉 독점적 비즈니스 지식, 노이즈가 포함된 도구 인터페이스, 신중한 정보 수집이 필요한 불완전한 가상 사용자 등을 포함하고 있습니다. 13개의 최첨단 모델을 평가한 결과, 연구실 환경에서의 성능과 실제 기업 도입 준비성 사이에 상당한 격차가 있음을 발견했습니다. 가장 정확도가 높은 모델이 가장 효율적인 것은 아니었으며, 도구 사용이 가능함에도 불구하고 도메인 지식에 대한 환각(hallucination) 현상이 발생했고, pass^k 지표에서는 20%의 성능 하락이 관찰되었습니다. UNDERWRITE의 실험 결과는 현실적인 에이전트 평가를 위한 벤치마크 설계에 전문가의 참여가 필수적이라는 점, 일반적인 에이전트 프레임워크가 성능 보고를 왜곡할 수 있는 취약성을 내재하고 있다는 점, 그리고 전문 도메인에서의 환각 탐지에는 복합적인 접근 방식이 필요하다는 점을 시사합니다. 본 연구는 기업의 실제 도입 요구사항에 더 잘 부합하는 벤치마크를 개발하는 데 필요한 통찰력을 제공합니다.

Original Abstract

As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not the most efficient, models hallucinate domain knowledge despite tool access, and pass^k results show a 20% drop in performance. The results from UNDERWRITE demonstrate that expert involvement in benchmark design is essential for realistic agent evaluation, common agentic frameworks exhibit brittleness that skews performance reporting, and hallucination detection in specialized domains demands compositional approaches. Our work provides insights for developing benchmarks that better align with enterprise deployment requirements.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!