2602.10359v2 Feb 10, 2026 eess.IV

교정(Calibration)을 넘어서: 복부 외상 CT 영상에서 혼란스러운 병리학적 요인이 기반 모델의 특이성을 제한하는 현상

Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

Jineel H Raythatha

Citations: 57

h-index: 4

Shuchang Ye

Citations: 33

h-index: 3

Jeremy Hsu

Citations: 9

h-index: 2

Jinman Kim

Citations: 2

h-index: 1

목적: 기반 모델을 임상 실무에 적용하기 위해서는 복합적인 분포 변화 하에서의 성능을 평가해야 합니다. 이때 심각한 클래스 불균형과 다양한 영상 특징이 함께 나타나는 경우가 발생할 수 있습니다. 본 연구는 이러한 과제가 높은 사망률을 보이는 드문 질환인 외상성 장기 손상 진단에 얼마나 관련이 있는지 조사했습니다. 우리는 기반 모델의 특이성 결함이 음성 클래스에서의 이질성과 관련이 있는지 확인하고자 했습니다. 방법: 본 연구는 2019년부터 2023년까지 23개 기관에서 수집된 다기관 RSNA 복부 외상 손상 CT 데이터 세트를 활용한 후향적 연구입니다. 두 개의 기반 모델(MedCLIP, 제로샷; RadDINO, 선형 탐색)을 세 가지의 특정 작업에 최적화된 접근 방식(CNN, Transformer, 앙상블)과 비교했습니다. 모델은 3,147명의 환자 데이터(장기 손상 발생률 2.3%)로 학습되었으며, 100명의 환자를 포함하는 확장된 테스트 세트를 사용하여 평가했습니다. 음성 클래스에 미치는 영향을 분리하기 위해, 장기 손상이 동반된 환자군(n=58)과 복부 병변이 없는 환자군(n=50)에서 특이성을 평가했습니다. 결과: 기반 모델은 특정 작업에 최적화된 모델과 유사한 수준의 구별력을 보였으며(AUC, 0.64-0.68 vs 0.58-0.64), 민감도는 더 높았지만(79-91% vs 41-74%), 특이도는 더 낮았습니다(33-50% vs 50-88%). 모든 모델은 복부 병변이 없는 환자군에서 높은 특이성을 보였습니다(84-100%). 장기 손상이 있는 경우, 기반 모델의 특이도는 현저하게 감소했습니다(50-51% 포인트), 반면 특정 작업에 최적화된 모델의 감소폭은 상대적으로 작았습니다(12-41% 포인트). 결론: 기반 모델은 특정 작업에 최적화된 학습 없이도 유사한 수준의 구별력을 보였지만, 특이성 결함은 클래스 불균형 자체보다는 음성 클래스에서의 이질성에 의해 주로 발생했습니다. 라벨링된 학습 데이터가 증가함에 따라 음성 클래스 이질성에 대한 민감도가 점진적으로 감소하는 것으로 나타났으며, 이는 임상 적용 전에 적응 과정이 필요함을 시사합니다.

Original Abstract

Purpose: Translating foundation models into clinical practice requires evaluating their performance under compound distribution shift, where severe class imbalance coexists with heterogeneous imaging appearances. This challenge is relevant for traumatic bowel injury, a rare but high-mortality diagnosis. We investigated whether specificity deficits in foundation models are associated with heterogeneity in the negative class. Methods: This retrospective study used the multi-institutional, RSNA Abdominal Traumatic Injury CT dataset (2019-2023), comprising scans from 23 centres. Two foundation models (MedCLIP, zero-shot; RadDINO, linear probe) were compared against three task-specific approaches (CNN, Transformer, Ensemble). Models were trained on 3,147 patients (2.3% bowel injury prevalence) and evaluated on an enriched 100-patient test set. To isolate negative-class effects, specificity was assessed in patients without bowel injury who had concurrent solid organ injury (n=58) versus no abdominal pathology (n=50). Results: Foundation models achieved equivalent discrimination to task-specific models (AUC, 0.64-0.68 versus 0.58-0.64) with higher sensitivity (79-91% vs 41-74%) but lower specificity (33-50% vs 50-88%). All models demonstrated high specificity in patients without abdominal pathology (84-100%). When solid organ injuries were present, specificity declined substantially for foundation models (50-51 percentage points) compared with smaller reductions of 12-41 percentage points for task-specific models. Conclusion: Foundation models matched task-specific discrimination without task-specific training, but their specificity deficits were driven primarily by confounding negative-class heterogeneity rather than prevalence alone. Susceptibility to negative-class heterogeneity decreased progressively with labelled training, suggesting adaptation is required before clinical implementation.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!