2604.08333v1 Apr 09, 2026 cs.CV

과장된 기대 속의 현실: 의료 다중 모달 대규모 언어 모델의 이미지 분류 성능 저하 현상 분석

Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

Kaili Zheng

Citations: 26

h-index: 2

Fanbin Mo

Citations: 27

h-index: 2

Miao Li

Citations: 77

h-index: 5

Xun Zhu

Citations: 5

h-index: 1

Shaoshu Yang

Citations: 818

h-index: 5

Yiming Shi

Citations: 42

h-index: 4

Jianbao Gao

Citations: 47

h-index: 4

Ji Wu

Citations: 74

h-index: 5

Xi Chen

Citations: 987

h-index: 4

다중 모달 대규모 언어 모델(MLLM)의 발전은 의료 영상 분석 분야에 혁신적인 변화를 가져왔습니다. 그러나 의료 영상 분류는 이러한 패러다임의 가장 초기이자 기본적인 과제로서, 최첨단 의료 MLLM이 사전 학습 데이터 및 모델 파라미터 측면에서 압도적인 이점을 가지고 있음에도 불구하고, 기존의 딥러닝 모델에 비해 일관되게 성능이 저하되는 현실을 보여줍니다. 이러한 역설은 근본적인 질문을 던지게 합니다: 성능 저하의 원인은 정확히 어디에서 발생하는가? 본 연구에서는 세 가지 대표적인 이미지 분류 데이터셋을 사용하여 14개의 공개 의료 MLLM에 대한 광범위한 실험을 수행했습니다. 단순한 성능 비교를 넘어, 우리는 시각적 특징이 MLLM 파이프라인 전체에서 모듈별, 레이어별로 어떻게 흐르는지 추적하기 위해 특징 탐색(feature probing) 기법을 활용했습니다. 이를 통해 분류 신호가 어디에서, 어떻게 왜곡, 희석 또는 무시되는지 명확하게 시각화할 수 있었습니다. 의료 MLLM의 분류 성능 저하 현상을 분석하는 최초의 시도인 본 연구는 다음과 같은 네 가지 실패 요인을 밝혀냈습니다: 1) 시각적 표현의 품질 제한, 2) 연결부 투영 과정에서의 충실도 손실, 3) LLM 추론 과정에서의 이해 부족, 4) 의미 매핑의 불일치. 또한, 특징 변화의 건강 상태를 나타내는 정량적 지표를 도입하여 다양한 MLLM 및 데이터셋 간의 체계적인 비교를 가능하게 했습니다. 더불어, 현재 의료 MLLM이 잠재력을 최대한 발휘하는 데 방해가 되는 핵심적인 문제점에 대한 심층적인 논의를 제공합니다. 본 연구가 의료 분야 MLLM 개발에 대한 새로운 관점을 제시하고, 높은 기대에서 실제 임상 적용 가능 모델로 나아가는 과정이 여전히 어렵다는 점을 강조하기를 희망합니다.

Original Abstract

The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!