2604.15808v1 Apr 17, 2026 cs.CV

단일 프레임을 넘어: 3차원 MRI 영상에서 공간적 정보를 활용한 다중 프레임 추론

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

Zelin Zhao

Citations: 9

h-index: 2

Lama Moukheiber

Citations: 1,109

h-index: 11

Caleb M. Yeung

Citations: 31

h-index: 4

Haotian Xue

Georgia Tech

Citations: 276

h-index: 8

Alec Helbling

Citations: 454

h-index: 5

Yongxin Chen

Citations: 3

h-index: 1

공간 추론 및 시각적 연관성은 비전-언어 모델(VLMs)의 핵심 기능이지만, 대부분의 의료 VLMs는 투명한 추론이나 공간적 증거 없이 예측을 수행합니다. 기존 벤치마크는 또한 2차원 이미지에 대한 VLMs의 성능을 평가하는데, 이는 임상 영상의 3차원적인 특성을 간과하며, 진단 결과가 여러 프레임에 걸쳐 나타나거나 일부 슬라이스에만 나타날 수 있다는 점을 고려하지 않습니다. 본 연구에서는 41,307개의 데이터 쌍으로 구성된 벤치마크인 Spatially Grounded MRI Visual Question Answering (SGMRI-VQA)를 소개합니다. 이 벤치마크는 fastMRI+ 데이터셋에 있는 뇌 및 무릎 연구 데이터를 기반으로 하며, 숙련된 방사선 전문의의 주석을 통해 작성되었습니다. 각 질문-답변 쌍은 임상의가 설계한 추론 과정을 포함하며, 프레임 인덱스가 지정된 경계 상자 좌표를 제공합니다. 본 연구에서는 탐지, 위치 파악, 개수 세기/분류, 캡셔닝 등 다양한 작업을 계층적으로 구성하여 모델이 존재하는 대상, 위치, 그리고 여러 프레임에 걸쳐 나타나는 범위를 함께 추론하도록 합니다. 10개의 VLMs를 비교 분석한 결과, 경계 상자 정보를 활용한 Qwen3-VL-8B 모델의 지도 학습이 강력한 초기 모델보다 공간적 연관성 성능을 꾸준히 향상시키는 것으로 나타났습니다. 이는 목표 지향적인 공간적 지도 학습이 의료 분야의 연관성 기반 추론을 위한 효과적인 방법임을 시사합니다.

Original Abstract

Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!