2601.06636v1 Jan 10, 2026 cs.CL

MedEinst: 반사실적 차등 진단을 통한 의료 LLM의 설정 효과 벤치마킹

MedEinst: Benchmarking the Einstellung Effect in Medical LLMs through Counterfactual Differential Diagnosis

Wenxuan Wang

Citations: 186

h-index: 5

Wenting Chen

Citations: 16

h-index: 2

Zhongrui Zhu

Citations: 5

h-index: 1

Guolin Huang

Citations: 63

h-index: 4

의료 벤치마크에서 높은 정확도를 달성했음에도 불구하고, LLM은 임상 진단에서 '설정 효과'를 나타냅니다. 즉, 환자 개별적인 증거보다는 통계적 단서를 의존하여 비정형적인 경우에 오진을 유발합니다. 기존 벤치마크는 이러한 중요한 오류 방식을 감지하지 못합니다. 본 연구에서는 49가지 질병에 걸쳐 5,383개의 쌍으로 구성된 반사실적 벤치마크인 MedEinst를 소개합니다. 각 쌍은 제어 사례와 진단을 뒤바꾸는 변경된 차별적 증거를 포함하는 '함정' 사례로 구성됩니다. 우리는 '편향 함정 비율'(Bias Trap Rate, 제어 사례는 정확하게 진단했지만 함정 사례는 오진할 확률)을 통해 LLM의 취약성을 측정합니다. 17개의 LLM에 대한 광범위한 평가는 최첨단 모델이 높은 기본 정확도를 달성하지만 심각한 편향 함정 비율을 보임을 보여줍니다. 따라서 우리는 LLM의 추론을 증거 기반 의학의 표준에 맞추는 ECR-Agent를 제안합니다. ECR-Agent는 다음 두 가지 구성 요소로 이루어져 있습니다. (1) 동적 인과 추론(DCI)은 이중 경로 인식, 세 가지 수준(연관성, 개입, 반사실적 추론)의 동적 인과 그래프 추론, 그리고 최종 진단을 위한 증거 감사를 통해 체계적인 추론을 수행합니다. (2) 비평 기반 그래프 및 메모리 진화(CGME)는 검증된 추론 경로를 예제 기반에 저장하고 질병별 지식을 진화하는 질병 그래프에 통합하여 시스템을 반복적으로 개선합니다. 소스 코드는 공개될 예정입니다.

Original Abstract

Despite achieving high accuracy on medical benchmarks, LLMs exhibit the Einstellung Effect in clinical diagnosis--relying on statistical shortcuts rather than patient-specific evidence, causing misdiagnosis in atypical cases. Existing benchmarks fail to detect this critical failure mode. We introduce MedEinst, a counterfactual benchmark with 5,383 paired clinical cases across 49 diseases. Each pair contains a control case and a "trap" case with altered discriminative evidence that flips the diagnosis. We measure susceptibility via Bias Trap Rate--probability of misdiagnosing traps despite correctly diagnosing controls. Extensive Evaluation of 17 LLMs shows frontier models achieve high baseline accuracy but severe bias trap rates. Thus, we propose ECR-Agent, aligning LLM reasoning with Evidence-Based Medicine standard via two components: (1) Dynamic Causal Inference (DCI) performs structured reasoning through dual-pathway perception, dynamic causal graph reasoning across three levels (association, intervention, counterfactual), and evidence audit for final diagnosis; (2) Critic-Driven Graph and Memory Evolution (CGME) iteratively refines the system by storing validated reasoning paths in an exemplar base and consolidating disease-specific knowledge into evolving illness graphs. Source code is to be released.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!