2601.06847v1 Jan 11, 2026 cs.CV

MedGround: 검증된 기반 데이터로 의료 영상-언어 모델의 증거 격차 해소

MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

Hao Luo

Citations: 239

h-index: 8

Fan Wang

Citations: 1

h-index: 1

Mengmeng Zhang

Citations: 5

h-index: 1

Xiaoping Wu

Citations: 18

h-index: 2

Yisheng Lv

Citations: 9

h-index: 2

영상-언어 모델(VLMs)은 설득력 있는 임상적 설명을 생성할 수 있지만, 종종 자신의 주장을 시각적으로 뒷받침하는 데 어려움을 겪습니다. 우리는 이러한 한계가 고품질의 대규모 임상적 참조-위치 정보 쌍의 부족에서 비롯된다고 생각합니다. 이를 해결하기 위해, 우리는 세분화(segmentation) 데이터를 고품질의 의료 참조 기반 데이터로 변환하는 자동화된 파이프라인인 MedGround를 소개합니다. MedGround는 전문가가 만든 마스크를 공간적 앵커로 활용하여 정확한 위치 정보를 추출하고, 모양 및 공간적 특징을 파악하여, VLMs가 형태 및 위치를 반영하는 자연스럽고 임상적으로 타당한 질문을 생성하도록 안내합니다. 데이터의 정확성을 보장하기 위해, 다단계 검증 시스템을 통해 엄격한 형식 검사, 기하학적 및 의학적 사전 규칙, 그리고 이미지 기반의 시각적 판단을 통합하여 모호하거나 시각적으로 뒷받침되지 않는 데이터를 제거합니다. 마지막으로, 우리는 새로운 다중 모드 의료 데이터셋인 MedGround-35K를 제시합니다. 광범위한 실험 결과, MedGround-35K로 훈련된 VLMs는 지속적으로 향상된 참조 기반 성능을 달성하고, 다중 객체 의미 중의성을 개선하며, 새로운 참조 환경에서도 강력한 일반화 능력을 보이는 것으로 나타났습니다. 본 연구는 MedGround를 검증 가능한 시각적 증거에 기반한 의료 언어를 연결하는 확장 가능하고 데이터 중심적인 접근 방식으로 강조합니다. 데이터셋과 코드는 채택 시 공개될 예정입니다.

Original Abstract

Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!