2605.14710v1 May 14, 2026 cs.CV

비전-코어 기반 대조 학습을 통한 뇌졸중의 균형 잡힌 다중 모달 예측

Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

Ting Xiao

Citations: 1

h-index: 1

Liren Chen

Citations: 3

h-index: 1

Lidong Sun

Citations: 948

h-index: 18

Min Huang

Citations: 14

h-index: 2

Jun Tang

Citations: 29

h-index: 3

Ying Zhu

Citations: 15

h-index: 1

Guanjie Wang

Citations: 6

h-index: 1

Yiqing Xia

Citations: 2

h-index: 1

심층 학습과 다중 모달 융합은 다양한 데이터 소스를 통합하여 의료 진단에 혁신적인 가능성을 보여주었습니다. 그러나 기존의 다중 모달 접근 방식의 한계로 인해 뇌경색 환자의 정확한 예후 예측은 여전히 어려운 과제입니다. 첫째, 현재의 방법들은 주로 이중 모달 융합에 국한되어 있으며, 의료 영상, 구조화된 임상 데이터 및 비정형 텍스트를 효과적으로 통합하는 프레임워크가 부족합니다. 둘째, 이러한 방법들은 종종 모달 간의 심층적인 양방향 상호 작용을 확립하지 못합니다. 이러한 중요한 격차를 해결하기 위해, 본 논문에서는 뇌경색 예후를 위한 새로운 삼중 모달 융합 모델을 제안합니다. 우리의 접근 방식은 먼저 대규모 언어 모델(LLM)을 활용하여 뇌 MRI에서 자동으로 반정형 진단 텍스트를 생성함으로써 데이터 표현을 풍부하게 합니다. 이 과정은 전문가 주석의 부족 문제를 해결할 뿐만 아니라, 정규화된 의미론적 향상을 제공하여 다중 모달 융합의 견고성을 향상시킵니다. 또한, 시각적 특징을 조건부 사전 정보로 활용하여 생성된 텍스트와의 정밀한 상호 작용을 유도하는 핵심 구성 요소인 '비전-조건부 이중 정렬 융합 모듈(VDAFM)'을 설계했습니다. 이 모듈은 이중 의미론적 정렬 손실을 통해 동적이고 심층적인 융합을 달성하여 모달 간의 이질성을 효과적으로 완화합니다. 실제 임상 데이터 세트에 대한 광범위한 실험 결과, 우리 모델이 최첨단 성능을 달성함을 보여줍니다.

Original Abstract

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!