2604.14656v1 Apr 16, 2026 cs.AI

환자 교육에 대한 재고: 다단계 다중 모드 상호 작용

Rethinking Patient Education as Multi-turn Multi-modal Interaction

Ben Wang

Citations: 27,797

h-index: 5

Zhipeng Tang

Citations: 29

h-index: 3

Zonghai Yao

UMASS Amherst

Citations: 714

h-index: 17

Hong Yu

Citations: 14

h-index: 2

Chen-Tan Lin

Citations: 16

h-index: 2

Xiong Luo

Citations: 68

h-index: 2

Juncheng Huang

Citations: 12

h-index: 1

C. S. Ong

Citations: 1

h-index: 1

대부분의 의료 다중 모드 벤치마크는 이미지 질의 응답, 보고서 생성, 쉬운 언어로의 재작성 등 정적인 작업에 초점을 맞춥니다. 환자 교육은 더욱 복잡하며, 시스템은 이미지, 보고서 텍스트 등을 기반으로 관련 정보를 식별하고, 환자에게 특정 부분을 가리키며, 이해하기 쉬운 언어로 결과를 설명하고, 환자의 혼란이나 불안을 처리해야 합니다. 그러나 대부분의 환자 교육 연구는 텍스트만 사용하는데, 이미지와 텍스트를 결합한 설명이 이해도를 높이는 데 더 효과적일 수 있습니다. 본 연구에서는 다단계, 근거 기반의 방사선 환자 교육 벤치마크인 MedImageEdu를 소개합니다. 각 사례는 보고서 텍스트와 관련 이미지로 구성됩니다. DoctorAgent는 환자 프로필(교육 수준, 건강 정보 이해 능력, 성격 등)에 따라 PatientAgent와 상호 작용합니다. 환자의 질문에 시각적 지원이 필요한 경우, DoctorAgent는 보고서, 이미지, 질문 내용을 기반으로 드로잉 도구에 지시를 내릴 수 있습니다. 도구는 이미지를 반환하고, DoctorAgent는 이미지와 함께 이해하기 쉬운 설명을 포함하는 최종 다중 모드 응답을 생성합니다. MedImageEdu는 세 가지 출처에서 수집된 150개의 사례로 구성되어 있으며, 상담 과정과 최종 다중 모드 응답을 5가지 측면(상담, 안전성 및 범위, 언어 품질, 드로잉 품질, 이미지-텍스트 응답 품질)에서 평가합니다. 다양한 공개 및 비공개 시각-언어 모델 에이전트에 대한 실험 결과, 다음과 같은 세 가지 일관된 문제점이 발견되었습니다. 유창한 언어 능력은 종종 시각적 정보의 정확성을 능가하며, 안전성은 모든 질병 범주에서 가장 취약한 측면이며, 감정적으로 긴장된 상호 작용은 교육 수준이 낮거나 건강 정보 이해 능력이 낮은 경우보다 어렵습니다. MedImageEdu는 다중 모드 에이전트가 텍스트만 기반으로 답변하는 것이 아니라, 근거를 바탕으로 교육할 수 있는지 평가할 수 있는 통제된 테스트 환경을 제공합니다.

Original Abstract

Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!