2605.04039v1 May 05, 2026 cs.CL

임상 대규모 언어 모델에서 안전성과 정확성은 서로 다른 확장 법칙을 따른다

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind

Citations: 16

h-index: 2

Jeta Sopa

Citations: 12

h-index: 1

Gerhard Wellein

Citations: 11

h-index: 1

Andreas K. Maier

Citations: 39

h-index: 3

Soroosh Tayebi Arasteh

Citations: 1,026

h-index: 12

S. Nebelung

Citations: 2,846

h-index: 27

D. Truhn

Citations: 8,092

h-index: 48

Mahshad Lotfinia

Citations: 231

h-index: 9

Tri-Thien Nguyen

Citations: 13

h-index: 1

H. Kostler

Citations: 5

h-index: 1

Sebastian Bickelhaup

Citations: 1

h-index: 1

M. Uder

Citations: 25

h-index: 3

임상 대규모 언어 모델(LLM)은 모델 크기, 컨텍스트 길이, 검색 복잡성 또는 추론 시간 컴퓨팅을 늘려 확장되는 경우가 많으며, 이때 더 높은 정확도가 더 안전한 행동을 의미한다는 암묵적인 기대를 갖습니다. 그러나 의학 분야에서는 평균 벤치마크 성능보다 몇 가지 확신에 찬 고위험, 또는 증거와 모순되는 오류가 더 중요할 수 있으므로, 이러한 가정은 불완전합니다. 우리는 SaFE-Scale이라는 프레임워크를 소개하여, 임상 LLM의 안전성이 모델 크기, 증거 품질, 검색 전략, 컨텍스트 노출 및 추론 시간 컴퓨팅에 따라 어떻게 변화하는지 측정합니다. 이 프레임워크를 구현하기 위해, 우리는 RadSaFE-200이라는 방사선 안전에 초점을 맞춘 200개의 객관식 질문 벤치마크를 소개합니다. 이 벤치마크는 임상의가 정의한 신뢰할 수 있는 증거, 상충되는 증거 및 옵션 수준 레이블을 포함하며, 고위험 오류, 안전하지 않은 답변 및 증거 모순을 식별합니다. 우리는 6가지 배포 조건(클로즈드-북 프롬프팅(제로샷), 신뢰할 수 있는 증거, 상충되는 증거, 표준 RAG, 에이전트 RAG, 최대 컨텍스트 프롬프팅)에서 34개의 로컬 LLM을 평가했습니다. 신뢰할 수 있는 증거는 가장 큰 개선을 가져왔으며, 평균 정확도를 73.5%에서 94.1%로 증가시키는 동시에, 고위험 오류를 12.0%에서 2.6%로, 모순을 12.7%에서 2.3%로, 그리고 위험한 과신을 8.0%에서 1.6%로 감소시켰습니다. 표준 RAG 및 에이전트 RAG은 이러한 안전성 프로필을 재현하지 못했습니다. 에이전트 RAG은 표준 RAG보다 정확도를 향상시키고 모순을 줄였지만, 고위험 오류 및 위험한 과신은 여전히 높았습니다. 최대 컨텍스트 프롬프팅은 안전 격차를 좁히지 않고 지연 시간을 증가시켰으며, 추가적인 추론 시간 컴퓨팅은 제한적인 이점만을 제공했습니다. 최악의 경우 분석 결과, 임상적으로 중요한 오류는 질문의 작은 부분에 집중되어 있었습니다. 따라서 임상 LLM의 안전성은 단순히 확장으로 인해 발생하는 수동적인 결과가 아니라, 증거 품질, 검색 설계, 컨텍스트 구성 및 집단적 실패 행동에 의해 형성되는 배포 특성입니다.

Original Abstract

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

1 Citations

0 Influential

24 Altmetric

121.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!