2605.14543v1 May 14, 2026 cs.LG

RxEval: LLM 약물 추천 평가를 위한 처방 수준 벤치마크

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

James T. Kwok

Citations: 2,318

h-index: 13

Shuhao Chen

Citations: 129

h-index: 4

Weisen Jiang

HKUST

Citations: 1,154

h-index: 12

Changmiao Wang

Citations: 113

h-index: 5

Xiaoqing Wu

Citations: 4

h-index: 1

Xuanren Shi

Citations: 79

h-index: 6

Yu Zhang

Citations: 990

h-index: 9

입원 환자의 약물 추천은 임상의가 환자의 상태 변화에 따라 특정 약물, 용량 및 투여 경로를 반복적으로 선택해야 하는 작업입니다. 기존 벤치마크는 이 작업을 입원 수준의 예측 문제로 정의하고, 세분화되지 않은 약물 코드를 사용하며, 진단 및 시술 코드 정보를 활용합니다. 하지만 이는 실제 처방의 시간별 정보와 환자에게 특화된 정보를 반영하지 못합니다. 본 연구에서는 LLM의 처방 능력을 다지선다형 문제로 평가하는 처방 수준 벤치마크인 RxEval을 제안합니다. 각 문제는 상세한 환자 프로필과 시간 순서대로 정렬된 임상 정보를 제공하며, 실제 처방 데이터에서 추출된 특정 약물-용량-투여 경로 조합과, 추론 과정을 변형하여 생성된 환자 맞춤형 오답 선택지를 제시합니다. RxEval은 584명의 환자, 18개의 진단 범주 및 969개의 고유 약물을 포함하는 1,547개의 문제로 구성됩니다. 16개의 LLM을 평가한 결과, RxEval은 높은 난이도와 모델 간의 성능 차이를 명확하게 보여줍니다. F1 점수는 45.18에서 77.10 사이이며, 최고 정확도(Exact Match)는 46.10%에 불과합니다. 오류 분석 결과, 최첨단 모델조차도 명시된 환자 정보를 간과하거나 임상적 결론을 도출하는 데 어려움을 겪는 것으로 나타났습니다.

Original Abstract

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!