2605.01417v1 May 02, 2026 cs.CL

Medmarks: 의료 분야 작업에 대한 포괄적인 오픈 소스 LLM 벤치마크 스위트

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

Robert Scholz

Citations: 30

h-index: 3

Anas Zafar

Citations: 1,099

h-index: 6

Paul S. Scotti

Citations: 156

h-index: 4

Jean-Benoit Delbrouck

Citations: 2,821

h-index: 22

Nishant Mishra

Citations: 22

h-index: 2

Hunar Batra

Citations: 115

h-index: 4

Ronald Clark

Citations: 27

h-index: 2

Benjamin Warner

Citations: 734

h-index: 3

Ratna Sagari Grandhi

Citations: 0

h-index: 0

Aymane Ouraq

Citations: 0

h-index: 0

S. Panigrahi

Citations: 23

h-index: 3

Geetu Ambwani

Citations: 55

h-index: 4

Kunal Bagga

Citations: 1

h-index: 1

Nikhil Khandekar

Citations: 71

h-index: 2

Arya Hariharan

Citations: 2

h-index: 1

M. Ram

Citations: 0

h-index: 0

Shamus Sim Zi Yang

Citations: 0

h-index: 0

Ahmed Essouaied

Citations: 9

h-index: 1

Adepoju Jeremiah Moyondafoluwa

Citations: 0

h-index: 0

Bofeng Huang

Citations: 0

h-index: 0

M. Beavers

Citations: 0

h-index: 0

Srishti Gureja

Citations: 81

h-index: 2

Anish Mahishi

Citations: 5

h-index: 1

Sameed Khan

Citations: 6

h-index: 2

Maxime Griot

Citations: 193

h-index: 4

Siddhant Bharadwaj

Citations: 5

h-index: 1

A. Vashist

Citations: 4

h-index: 1

L. Murali

Citations: 46

h-index: 4

Harsh Deshpande

Citations: 12

h-index: 2

Ameen Patel

Citations: 20

h-index: 3

William Brown

Citations: 94

h-index: 3

Johannes Hagemann

Citations: 105

h-index: 6

Connor Lane

Citations: 8

h-index: 2

Tanishq Mathew Abraham

Citations: 631

h-index: 5

Maxime Kieffer

Citations: 114

h-index: 3

의료 분야 애플리케이션을 위한 대규모 언어 모델(LLM)의 평가는 벤치마크의 포화 현상, 제한적인 데이터 접근성, 그리고 관련 작업에 대한 불충분한 보완성 때문에 여전히 어려운 과제입니다. 기존 벤치마크들은 벤치마크의 포화 현상이 있거나, 특정 데이터셋에 크게 의존하거나, 모델의 포괄적인 평가를 제공하지 못하는 경우가 많습니다. 본 연구에서는 30개의 벤치마크를 포함하는 완전한 오픈 소스 평가 스위트인 Medmarks를 소개합니다. 이 스위트는 질문 답변, 정보 추출, 의료 계산, 그리고 개방형 임상 추론 작업을 포함합니다. 저희는 검증 가능한 지표와 LLM-as-a-Judge를 사용하여 71개의 구성에서 61개의 모델을 체계적으로 평가했습니다. 그 결과, 최첨단 추론 모델(Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2)이 대부분의 벤치마크에서 가장 높은 성능을 보였으며, 대부분의 최첨단 독점 모델이 오픈 소스 모델보다 토큰 효율성이 훨씬 높고, 의료 분야에 특화된 모델이 일반 모델보다 성능이 우수하며, 모델들이 답변 순서에 민감하게 반응한다는 것을 확인했습니다(특히 작은 모델과 Grok 4). 저희의 일부 평가 데이터(Medmarks-T)는 LLM을 의료 추론에 적합하도록 추가 훈련하는 데 직접적으로 사용될 수 있는 강화 학습 환경으로 활용될 수 있습니다. 코드는 https://github.com/MedARC-AI/Medmarks에서 확인할 수 있습니다.

Original Abstract

Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks

0 Citations

0 Influential

50.25073800855 Altmetric

251.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!