2601.05500v1 Jan 09, 2026 cs.AI

의학, AI, LLM에서의 평가 격차: 확률론적 패러다임을 이용한 모호한 정답(Ground Truth)과 불확실성 탐색

The Evaluation Gap in Medicine, AI and LLMs: Navigating Elusive Ground Truth & Uncertainty via a Probabilistic Paradigm

Aparna Elangovan

Citations: 291

h-index: 7

Lei Xu

Citations: 53

h-index: 2

Mahsa Elyasi

Citations: 61

h-index: 3

İ. Akdulum

Citations: 105

h-index: 5

Mehmet Aksakal

Citations: 78

h-index: 6

Enes Gurun

Citations: 54

h-index: 4

Saab Mansour

Citations: 0

h-index: 0

Ravid Shwartz Ziv

Citations: 0

h-index: 0

Dan Roth

Citations: 51

h-index: 2

Brian Hur

Citations: 114

h-index: 1

Karin Verspoor

Citations: 2

h-index: 1

거대언어모델(LLM)과 비전 모델을 포함한 AI 시스템의 상대적 능력을 벤치마킹할 때, 전문가가 제공한 기초 정답(ground truth)에 내재된 불확실성의 영향은 일반적으로 간과됩니다. 이러한 모호성은 불확실성이 만연한 의학 분야에서 특히 중대한 결과를 초래합니다. 본 논문에서는 확률론적 패러다임을 도입하여, 전문가조차 높은 점수를 얻기 위해서는 정답의 확실성이 높아야 하는 반면, 정답의 변동성이 큰 데이터셋에서는 무작위 라벨러(random labeller)와 전문가 간에 차이가 거의 없을 수 있음을 이론적으로 설명합니다. 따라서 평가 데이터 정답의 불확실성을 무시하면 비전문가가 전문가와 유사한 성능을 보인다는 잘못된 결론을 도출할 수 있습니다. 이에 우리는 확률론적 패러다임을 이용하여, 정답의 변동성이 주어졌을 때 인간 전문가나 시스템이 달성할 수 있는 점수를 추정하기 위해 '기대 정확도'와 '기대 F1' 개념을 제시합니다. 본 연구는 시스템의 능력을 평가할 때, 일반적으로 정답 작성 전문가들의 일치율로 측정되는 정답 확률에 따라 결과를 층화(stratify)해야 한다고 제안합니다. 이러한 층화는 전체 성능이 80% 임계값 미만으로 떨어질 때 특히 중요합니다. 층화 평가를 적용하면 확실성이 높은 구간에서의 성능 비교가 더욱 신뢰성을 갖게 되며, 핵심 교란 요인인 불확실성의 영향을 완화할 수 있습니다.

Original Abstract

Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is particularly consequential in medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor -- uncertainty.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!