2601.07107v1 Jan 12, 2026 cs.CV

MEDVISTAGYM: 툴 통합 강화 학습을 통한 의료 영상 기반 추론을 위한 확장 가능한 학습 환경

MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

Charles Fleming

Citations: 14

h-index: 2

Meng Lu

Citations: 91

h-index: 2

Yuxing Lu

Citations: 3

h-index: 1

Yuchen Zhuang

Citations: 2,057

h-index: 21

Megan C. Mullins

Citations: 91

h-index: 2

Yang Xie

Citations: 23

h-index: 3

Guanghua Xiao

Citations: 36

h-index: 3

Wenqi Shi

Georgia Institute of Technology

Citations: 349

h-index: 8

Xuan Wang

Citations: 151

h-index: 6

비전-언어 모델(VLMs)은 일반적인 이미지 이해 분야에서 뛰어난 성능을 보이지만, 특히 반복적인 시각적 상호 작용을 통한 다단계 추론을 수행할 때 의료 영상에 대한 추론에는 어려움을 겪습니다. 의료용 VLM은 종종 정적인 시각적 임베딩과 단일 패스 추론에 의존하며, 이는 모델이 추론 과정에서 시각적 증거를 재검토, 검증 또는 개선하는 것을 방지합니다. 툴 통합 추론은 유망한 해결책을 제시하지만, 오픈 소스 VLM은 다중 모드 의료 추론에서 효과적인 툴 선택, 호출 및 조정을 학습할 수 있는 학습 인프라가 부족합니다. 본 연구에서는 의료 영상 분석을 위한 툴 통합 시각적 추론을 장려하는 확장 가능하고 상호 작용적인 학습 환경인 MedVistaGym을 소개합니다. MedVistaGym은 VLM이 언제 어떤 툴을 사용할지 결정하고, 작업과 관련된 이미지 영역을 찾고, 단일 또는 여러 하위 이미지 증거를 통합하여 통합된 실행 가능한 인터페이스 내에서 다중 모드 추론을 수행할 수 있도록 지원합니다. MedVistaGym을 사용하여, 우리는 경로 샘플링과 엔드투엔드 강화 학습을 통해 MedVistaGym-R1을 훈련시켜 툴 사용과 에이전트 기반 추론을 결합하도록 했습니다. 6개의 의료 VQA 벤치마크에서, MedVistaGym-R1-8B는 유사한 크기의 툴 기반 모델보다 19.10%에서 24.21% 더 높은 성능을 보였습니다. 이는 툴 접근뿐만 아니라 체계적인 에이전트 기반 훈련이 의료 영상 분석을 위한 효과적인 툴 통합 추론을 가능하게 한다는 것을 보여줍니다.

Original Abstract

Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.

2 Citations

0 Influential

10.5 Altmetric

54.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!