2603.09909v1 Mar 10, 2026 cs.AI

MedMASLab: 다중 모드 의료 다중 에이전트 시스템 벤치마킹을 위한 통합 오케스트레이션 프레임워크

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Jiang-She Zhang

Citations: 179

h-index: 8

Yu-feng Qian

Citations: 16

h-index: 2

Xiaobin Hu

Citations: 13

h-index: 2

Siyang Xin

Citations: 8

h-index: 2

Xiaokun Chen

Citations: 244

h-index: 5

Peng-Tao Jiang

Citations: 35

h-index: 3

Jiawei Liu

Citations: 87

h-index: 2

Hongwei Li

Citations: 14

h-index: 2

Jiaquan Yu

Citations: 3

h-index: 1

다중 에이전트 시스템(MAS)은 복잡한 임상 의사 결정 지원에 잠재력을 보여주지만, 아키텍처의 단편화와 표준화된 다중 모드 통합 부족으로 인해 어려움을 겪고 있습니다. 현재 의료 MAS 연구는 비균일한 데이터 수집 파이프라인, 일관성 없는 시각적 추론 평가, 그리고 전문 분야 간 벤치마킹 부족으로 인해 제한됩니다. 이러한 문제점을 해결하기 위해, 우리는 다중 모드 의료 다중 에이전트 시스템을 위한 통합 프레임워크 및 벤치마킹 플랫폼인 MedMASLab을 제시합니다. MedMASLab은 다음과 같은 기능을 제공합니다. (1) 24가지의 다양한 의료 모드에 걸쳐 11개의 이질적인 MAS 아키텍처를 원활하게 통합할 수 있는 표준화된 다중 모드 에이전트 통신 프로토콜. (2) 어휘 기반 문자열 매칭의 한계를 극복하고 진단 논리와 시각적 연관성을 검증하기 위해 대규모 시각-언어 모델을 활용하는, 제로샷 의미 평가 패러다임을 갖춘 자동화된 임상 추론 평가기. (3) 11개의 장기 시스템과 473개의 질병을 포괄하는, 현재까지 가장 광범위한 벤치마크로, 11개의 임상 벤치마크에서 데이터를 표준화했습니다. 체계적인 평가는 중요한 도메인 특이적 성능 격차를 드러냅니다. 즉, MAS는 추론 깊이를 향상시키지만, 현재 아키텍처는 전문화된 의료 하위 도메인 간 전환 시 상당한 불안정성을 보입니다. 우리는 상호 작용 메커니즘과 비용-성능 균형에 대한 엄격한 분석을 제공하여, 미래의 자율 임상 시스템을 위한 새로운 기술적 기준을 제시합니다. 소스 코드 및 데이터는 다음에서 공개적으로 이용 가능합니다: https://github.com/NUS-Project/MedMASLab/

Original Abstract

While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS-Project/MedMASLab/

2 Citations

0 Influential

38.451858789481 Altmetric

194.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!