2602.16747v1 Feb 18, 2026 cs.LG

LiveClin: 데이터 유출 없이 실제 임상 환경을 반영하는 실시간 임상 벤치마크

LiveClin: A Live Clinical Benchmark without Leakage

Xidong Wang

Citations: 1,064

h-index: 11

Jinjie Gu

Citations: 476

h-index: 12

Benyou Wang

Citations: 3

h-index: 1

Yue Shen

Citations: 240

h-index: 5

Shuqi Guo

Citations: 15

h-index: 3

Junying Chen

Citations: 8

h-index: 2

Lei Liu

Citations: 258

h-index: 4

Ping Zhang

Citations: 71

h-index: 5

Jian Wang

Citations: 216

h-index: 5

의료 LLM 평가의 신뢰성은 데이터 오염 및 지식의 노후화로 인해 심각하게 저해되어, 정적인 벤치마크에서 과장된 점수를 얻는 경우가 발생합니다. 이러한 문제점을 해결하기 위해, 실제 임상 환경을 보다 정확하게 반영하도록 설계된 실시간 벤치마크인 LiveClin을 소개합니다. LiveClin은 최신 동료 검토를 거친 사례 보고서를 기반으로 구축되었으며, 매년 두 번 업데이트를 통해 임상적 최신성을 유지하고 데이터 오염을 방지합니다. 239명의 의료진이 참여하는 검증된 AI-인간 협업 워크플로우를 통해, 실제 환자 사례를 복잡하고 다중 모달 방식으로 변환하여 전체 임상 경로를 포괄하는 평가 시나리오를 제공합니다. 현재 벤치마크는 1,407개의 사례 보고서와 6,605개의 질문으로 구성되어 있습니다. LiveClin을 사용하여 26개의 모델을 평가한 결과, 실제 임상 시나리오의 어려움이 두드러지게 나타났으며, 가장 성능이 좋은 모델의 경우에도 사례 정확도가 35.7%에 불과했습니다. 인간 전문가와의 비교 평가에서, 주임 의사가 가장 높은 정확도를 보였으며, 그 뒤를 전공의가 바짝 뒤쫓았고, 이들은 대부분의 모델보다 높은 정확도를 기록했습니다. LiveClin은 지속적으로 진화하는 임상 기반 프레임워크를 제공하여, 의료 LLM의 개발을 가이드하고, 신뢰성을 높이며, 실제 활용성을 향상시키는 데 기여합니다. 저희의 데이터와 코드는 https://github.com/AQ-MedAI/LiveClin 에서 공개적으로 이용할 수 있습니다.

Original Abstract

The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human experts, Chief Physicians achieved the highest accuracy, followed closely by Attending Physicians, with both surpassing most models. LiveClin thus provides a continuously evolving, clinically grounded framework to guide the development of medical LLMs towards closing this gap and achieving greater reliability and real-world utility. Our data and code are publicly available at https://github.com/AQ-MedAI/LiveClin.

2 Citations

0 Influential

32.931471805599 Altmetric

166.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!