2603.22179v1 Mar 23, 2026 cs.AI

MARCUS: 심장 질환 진단 및 관리를 위한 능동적이고 다중 모달 비전-언어 모델

MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

Ehsan Adeli

Citations: 6,516

h-index: 9

J. W. O'Sullivan

Citations: 5

h-index: 1

A. Chaudhari

Citations: 84

h-index: 2

Tahoura Nedaee

Citations: 5

h-index: 1

Francois Haddad

Citations: 6

h-index: 2

Michael Salerno

Citations: 1

h-index: 1

R. Arnaout

Citations: 5

h-index: 1

Euan A Ashley

Citations: 164

h-index: 5

Mohammad Asadi

Citations: 28

h-index: 3

Lennart Elbe

Citations: 0

h-index: 0

Li Fe-Fei

Citations: 0

h-index: 0

심혈관 질환은 여전히 전 세계 사망 원인의 최상위권을 차지하며, 복잡한 심장 검사의 인간 해석에 의존하는 점이 발전의 걸림돌입니다. 현재의 AI 비전-언어 모델은 단일 모달 입력에만 제한되며, 상호작용 기능이 없습니다. 본 연구에서는 심전도(ECG), 심장 초음파, 심장 자기 공명 영상(CMR)을 독립적으로 또는 다중 모달 입력으로 해석할 수 있는 능동적인 비전-언어 시스템인 MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals)를 소개합니다. MARCUS는 도메인 특화 시각 인코더와 다단계 언어 모델 최적화를 통합한 모달별 비전-언어 전문가 모델로 구성된 계층적 능동적 아키텍처를 채택하고 있으며, 이는 다중 모달 오케스트레이터에 의해 조정됩니다. 1350만 개의 이미지(심전도 0.25만 개, 심장 초음파 이미지 130만 개, 심장 자기 공명 영상 1200만 개)와 160만 개의 질문으로 구성된 당사에서 직접 큐레이션한 데이터 세트로 학습된 MARCUS는 GPT-5 Thinking 및 Gemini 2.5 Pro Deep Think와 같은 최첨단 모델을 능가하는 최고 수준의 성능을 달성했습니다. 내부(스탠포드) 및 외부(UCSF) 테스트 그룹에서 MARCUS는 심전도에서 87-91%, 심장 초음파에서 67-86%, 심장 자기 공명 영상에서 85-88%의 정확도를 달성하여 최첨단 모델보다 34-45% 더 높은 성능을 보였습니다 (P<0.001). 다중 모달 데이터의 경우, MARCUS는 70%의 정확도를 달성하여 최첨단 모델(22-28%)보다 거의 3배 높은 성능을 보였으며, 자유 형식 텍스트 품질 점수도 1.7-3.0배 높았습니다. 또한, MARCUS의 능동적 아키텍처는 비전-언어 모델이 의도하지 않은 텍스트 신호 또는 환각된 시각적 콘텐츠로부터 추론을 도출하는 '미라지 추론' 현상에 대한 저항력을 제공합니다. MARCUS는 도메인 특화 시각 인코더와 능동적 오케스트레이터를 통해 다중 모달 심장 해석이 가능하다는 것을 입증합니다. 당사는 당사의 모델, 코드 및 벤치마크를 오픈 소스로 공개합니다.

Original Abstract

Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!