2604.10233v1 Apr 11, 2026 cs.CV

2D 다중 모드 대규모 언어 모델을 3D CT 이미지 분석에 적용

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

Dunyuan Xu

Citations: 21

h-index: 3

Yaoqian Li

Citations: 45

h-index: 4

Xiaomeng Li

Citations: 47

h-index: 3

Jinpeng Li

Citations: 151

h-index: 7

Pheng-Ann Heng

Citations: 396

h-index: 11

Yang Yu

Citations: 13

h-index: 2

3차원 의료 영상 분석은 질병 진단 및 치료에 매우 중요합니다. 최근 다중 모드 대규모 언어 모델(MLLM)은 뛰어난 인지 능력, 강력한 모드 간 정렬 능력, 그리고 유망한 일반화 성능을 보여주었습니다. 따라서, 이러한 모델은 임상 환경에서 중요한 역할을 하는 의료 보고서 생성(MRG) 및 의료 시각 질의 응답(MVQA) 성능을 향상시키는 데 큰 잠재력을 가지고 있습니다. 그러나, 3차원 의료 영상 데이터의 부족으로 인해 기존의 3차원 의료 MLLM은 사전 학습된 비전 인코더의 성능이 충분하지 않고, 다양한 작업에 필요한 맞춤형 이미지 특징을 추출하는 데 어려움을 겪습니다. 본 논문에서는 먼저 2차원 자연 이미지로 잘 학습된 2D MLLM을 3차원 의료 볼륨 데이터에 적용하고, 모든 사전 학습된 파라미터를 재사용합니다. 또한, 비전 인코더가 다양한 작업에 필요한 맞춤형 이미지 특징을 추출할 수 있도록, 텍스트 프롬프트를 통해 작업을 구별할 수 있는 텍스트 기반 계층적 MoE(TGH-MoE) 프레임워크를 설계했습니다. 더욱이, 작업에 공통적으로 사용되는 이미지 특징과 작업별로 특화된 이미지 특징을 모두 학습할 수 있는 두 단계의 학습 전략을 제안합니다. 실험 결과, 제안하는 방법은 기존의 3차원 의료 MLLM보다 MRG 및 MVQA 작업 모두에서 더 우수한 성능을 보였습니다. 본 논문이 채택되면 코드를 공개할 예정입니다.

Original Abstract

3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!