2603.01343v1 Mar 02, 2026 cs.CL

PanCanBench: 췌장암 분야에서 대규모 언어 모델을 평가하기 위한 종합적인 벤치마크

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yiming Zhao

Citations: 60

h-index: 5

Sheela R. Damle

Citations: 332

h-index: 9

Simone E Dekker

Citations: 3

h-index: 1

Scott Geng

Citations: 176

h-index: 3

Jesse J. Hubbard

Citations: 676

h-index: 8

Fatima Zelada-Arenas

Citations: 20

h-index: 2

Brianne Flores

Citations: 0

h-index: 0

S. Salerno

Citations: 1,324

h-index: 16

Carrie Wright

Citations: 16

h-index: 2

Zihao Wang

Citations: 320

h-index: 2

Pang Wei W. Koh

Citations: 57

h-index: 2

Jeff Leek

Citations: 75

h-index: 3

Karly Williams Silva

Citations: 0

h-index: 0

M. F. Fernández

Citations: 43

h-index: 2

A. Alvarez

Citations: 29

h-index: 2

Alexis Rodriguez

Citations: 1

h-index: 1

대규모 언어 모델(LLM)은 표준화된 시험에서 전문가 수준의 성능을 달성했지만, 객관식 정확도는 실제 임상적 유용성과 안전성을 제대로 반영하지 못합니다. 환자와 임상의가 췌장암과 같은 복잡한 질환에 대한 지침을 얻기 위해 LLM을 점점 더 많이 사용함에 따라, 평가는 일반적인 의학 지식을 넘어 확장되어야 합니다. 기존 프레임워크인 HealthBench와 같은 시스템은 시뮬레이션된 질문에 의존하며, 질병 특이적인 깊이가 부족합니다. 또한, 높은 점수는 사실 정확성을 보장하지 않으며, 이는 환각 현상을 평가해야 할 필요성을 강조합니다. 우리는 췌장암 행동 네트워크(PanCAN)에서 수집한 익명화된 환자 질문에 대한 전문가 평가 기준을 생성하기 위해 인간-루프 시스템을 개발했습니다. 결과적으로 생성된 벤치마크인 PanCanBench는 282개의 실제 환자 질문에 대한 3,130개의 질문별 기준을 포함합니다. 우리는 LLM-as-a-judge 프레임워크를 사용하여 22개의 독점 및 오픈 소스 LLM을 평가하고, 임상적 완전성, 사실 정확성 및 웹 검색 통합을 측정했습니다. 모델은 평가 기준 기반의 완전성에서 상당한 차이를 보였으며, 점수는 46.5%에서 82.3%까지 다양했습니다. 사실 오류가 흔했으며, 환각 비율(최소한 하나의 사실 오류를 포함하는 응답의 비율)은 Gemini-2.5 Pro 및 GPT-4o의 경우 6.0%에서 Llama-3.1-8B의 경우 53.8%에 이르렀습니다. 중요한 점은, 최신 추론 최적화 모델이 항상 사실성을 향상시키지는 못한다는 것입니다. o3는 가장 높은 평가 기준 점수를 달성했지만, 다른 GPT 계열 모델보다 부정확한 내용을 더 자주 생성했습니다. 웹 검색 통합이 반드시 더 나은 응답을 보장하는 것은 아닙니다. 웹 검색이 활성화되면 Gemini-2.5 Pro의 평균 점수는 66.8%에서 63.9%로, GPT-5의 경우 73.8%에서 72.8%로 변경되었습니다. 인공지능이 생성한 평가 기준은 평균적으로 절대 점수를 17.9점 증가시켰지만, 일반적으로 상대적인 순위는 유사하게 유지되었습니다.

Original Abstract

Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!