2601.15161v1 Jan 21, 2026 cs.CL

의료 대화 시스템의 신뢰성 있는 평가를 위한 자동화된 평가 기준

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Hossein A. Rahmani

Citations: 177

h-index: 7

Yinzhu Chen

Citations: 2

h-index: 1

Abdine Maiga

Citations: 11

h-index: 2

Emine Yilmaz

Citations: 2

h-index: 1

대규모 언어 모델(LLM)은 임상 의사 결정 지원에 점점 더 많이 사용되고 있지만, 환자 안전에 직접적인 위험을 초래할 수 있는 환각 현상 및 부적절한 제안이 발생할 수 있습니다. 이러한 위험은 종종 미묘한 임상 오류로 나타나 일반적인 지표로는 탐지하기 어렵고, 전문가가 작성한 상세한 평가 기준은 구축 비용이 많이 들고 확장하기 어렵다는 문제가 있습니다. 본 논문에서는 특정 사례에 맞는 평가 기준을 자동으로 생성하도록 설계된 검색 증강형 다중 에이전트 프레임워크를 제안합니다. 우리의 접근 방식은 검색된 콘텐츠를 기본적인 사실로 분해하고, 사용자와의 상호 작용 제약 조건을 합성하여 검증 가능하고 세분화된 평가 기준을 형성함으로써 권위 있는 의료 증거에 기반한 평가를 수행합니다. HealthBench 데이터셋으로 평가한 결과, 제안된 프레임워크는 임상 의도 일치도(CIA) 점수가 60.12%로, GPT-4o 기준(55.16%)보다 통계적으로 유의미하게 개선되었습니다. 차별적인 테스트에서, 제안된 평가 기준은 평균 점수 차이($μ_Δ = 8.658$)와 0.977의 AUROC 값을 나타내며, 이는 GPT-4o 기준(4.972)이 달성한 품질 격차를 거의 두 배로 높이는 결과입니다. 평가 외에도, 제안된 평가 기준은 응답 개선을 효과적으로 안내하여 품질을 9.2% 향상시켰습니다(59.0%에서 68.2%로). 이는 의료 LLM을 평가하고 개선하기 위한 확장 가능하고 투명한 기반을 제공합니다. 코드는 https://anonymous.4open.science/r/Automated-Rubric-Generation-AF3C/ 에서 확인할 수 있습니다.

Original Abstract

Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are particularly challenging as they often manifest as subtle clinical errors that evade detection by generic metrics, while expert-authored fine-grained rubrics remain costly to construct and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench, our framework achieves a Clinical Intent Alignment (CIA) score of 60.12%, a statistically significant improvement over the GPT-4o baseline (55.16%). In discriminative tests, our rubrics yield a mean score delta ($μ_Δ = 8.658$) and an AUROC of 0.977, nearly doubling the quality separation achieved by GPT-4o baseline (4.972). Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2% (from 59.0% to 68.2%). This provides a scalable and transparent foundation for both evaluating and improving medical LLMs. The code is available at https://anonymous.4open.science/r/Automated-Rubric-Generation-AF3C/.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!