2604.09937v1 Apr 10, 2026 cs.AI

HealthAdminBench: 의료 관리 업무에 대한 컴퓨터 사용 에이전트 평가

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Oluwasanmi Koyejo

Citations: 19,806

h-index: 42

Suhana Bedi

Citations: 771

h-index: 10

Ryan Welch

Citations: 17

h-index: 3

E. Steinberg

Citations: 1,707

h-index: 18

Michael Wornow

Citations: 1,492

h-index: 15

Taeil Kim

Citations: 14

h-index: 3

Peter V. Sterling

Citations: 105

h-index: 3

Bravim K. Purohit

Citations: 3

h-index: 1

Qurat-Ul-Ain Akram

Citations: 4

h-index: 1

Angelic Acosta

Citations: 3

h-index: 1

Esther Nubla

Citations: 3

h-index: 1

M. Pfeffer

Citations: 76

h-index: 3

H. Ahmed

Citations: 26

h-index: 3

Priti Sharma

Citations: 3

h-index: 1

Nigam H. Shah

Citations: 207

h-index: 7

의료 관리 분야는 연간 1조 달러 이상의 지출이 발생하는 중요한 영역이며, 이는 LLM 기반 컴퓨터 사용 에이전트(CUA)에게 잠재력이 큰 대상입니다. LLM의 임상 분야 적용에 대한 연구는 활발하지만, 엔드 투 엔드 의료 관리 워크플로우에 대한 CUA 평가를 위한 벤치마크는 존재하지 않습니다. 이러한 격차를 해소하기 위해, 본 연구에서는 HealthAdminBench를 소개합니다. HealthAdminBench는 실제 GUI 환경 4가지(전자의 의료 기록 시스템, 두 개의 보험사 포털, 팩스 시스템)와 135개의 전문가가 정의한 작업으로 구성된 벤치마크이며, 이 작업들은 사전 승인, 이의 제기 및 거부 관리, 의약품 주문 처리의 세 가지 의료 관리 업무 유형을 포괄합니다. 각 작업은 세분화된, 검증 가능한 하위 작업으로 구성되어 총 1,698개의 평가 지점을 제공합니다. 우리는 다양한 프롬프트 및 관찰 환경에서 7가지 에이전트 구성을 평가한 결과, 하위 작업 성능은 우수하지만, 엔드 투 엔드 신뢰성은 여전히 낮다는 것을 확인했습니다. 가장 성능이 좋은 에이전트(Claude Opus 4.6 CUA)는 36.3%의 작업 성공률을 기록했으며, GPT-5.4 CUA는 가장 높은 하위 작업 성공률(82.8%)을 달성했습니다. 이러한 결과는 현재 에이전트의 기능과 실제 의료 관리 워크플로우의 요구 사항 간에 상당한 격차가 있음을 보여줍니다. HealthAdminBench는 안전하고 신뢰할 수 있는 의료 관리 워크플로우 자동화 발전을 평가하기 위한 엄격한 기반을 제공합니다.

Original Abstract

Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

3 Citations

1 Influential

21 Altmetric

110.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!