2601.16685v1 Jan 23, 2026 cs.AI

AgentsEval: 다중 에이전트 추론을 통한 의료 영상 판독문의 임상적 충실도 평가

AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning

Jingqi Dong

Citations: 0

h-index: 0

Xuan Ding

Citations: 33

h-index: 4

Rui Sun

Citations: 6

h-index: 1

Yiming Yang

Citations: 23

h-index: 3

Shuguang Cui

Citations: 697

h-index: 14

Zhen Li

Citations: 21

h-index: 3

Suzhong Fu

Citations: 18

h-index: 3

자동 생성된 의료 영상 판독문의 임상적 정확성과 추론 충실도를 평가하는 것은 여전히 중요하지만 해결되지 않은 과제로 남아 있습니다. 기존의 평가 방법들은 종종 영상 의학적 해석의 기저에 있는 구조화된 진단 논리를 포착하지 못하여, 신뢰할 수 없는 판단과 제한적인 임상적 관련성을 초래합니다. 우리는 영상의학과 전문의들의 협업적 진단 워크플로우를 모방한 다중 에이전트 스트림 추론 프레임워크인 AgentsEval을 소개합니다. AgentsEval은 평가 과정을 기준 정의, 근거 추출, 정렬 및 일관성 점수 산정 등 해석 가능한 단계로 나누어, 명시적인 추론 추적과 구조화된 임상 피드백을 제공합니다. 또한 우리는 다양한 영상 모달리티와 통제된 의미적 변형을 포함하는 5개의 의료 판독문 데이터셋을 아우르는 다중 도메인 섭동(perturbation) 기반 벤치마크를 구축했습니다. 실험 결과, AgentsEval은 임상적으로 정렬되고 의미적으로 충실하며 해석 가능한 평가를 제공하며, 환언, 의미적, 문체적 섭동 하에서도 견고함을 유지하는 것으로 입증되었습니다. 이 프레임워크는 의료 판독문 생성 시스템에 대한 투명하고 임상에 기반한 평가를 향한 진일보를 나타내며, 대규모 언어 모델의 임상 진료 도입에 대한 신뢰를 증진시킵니다.

Original Abstract

Evaluating the clinical correctness and reasoning fidelity of automatically generated medical imaging reports remains a critical yet unresolved challenge. Existing evaluation methods often fail to capture the structured diagnostic logic that underlies radiological interpretation, resulting in unreliable judgments and limited clinical relevance. We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists. By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback. We also construct a multi-domain perturbation-based benchmark covering five medical report datasets with diverse imaging modalities and controlled semantic variations. Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations. This framework represents a step toward transparent and clinically grounded assessment of medical report generation systems, fostering trustworthy integration of large language models into clinical practice.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!