2602.14158v1 Feb 15, 2026 cs.CL

의료 인공지능을 위한 다중 에이전트 프레임워크: 근거 기반 및 편향 인지 임상 질의 처리를 위한 미세 조정된 GPT, LLaMA 및 DeepSeek R1 활용

A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

N. Nourmohammadi

Citations: 5

h-index: 1

Md Meem Hossain

Citations: 19

h-index: 2

Safina Showkat Ara

Citations: 135

h-index: 5

H. Anh

Citations: 1,670

h-index: 26

Zia Ush-Shamszaman

Citations: 227

h-index: 8

대규모 언어 모델(LLM)은 의료 분야의 질의 응답에 유망한 가능성을 보여주지만, 검증의 취약성, 충분하지 않은 근거 기반, 그리고 신뢰할 수 없는 신뢰도 지표로 인해 실제 임상 활용에는 제한이 있습니다. 우리는 답변의 신뢰성을 향상시키기 위해 근거 검색, 불확실성 추정, 그리고 편향 검사를 결합한 다중 에이전트 의료 질의 응답 프레임워크를 제안합니다. 저희의 접근 방식은 두 단계로 구성됩니다. 첫째, GPT, LLaMA, 그리고 DeepSeek R1이라는 세 가지 대표적인 LLM 패밀리를 NIH의 다양한 분야에 걸쳐 생성된 20,000개 이상의 질의-응답 쌍을 포함하는 MedQuAD 기반의 의료 질의 응답 데이터로 미세 조정하고 생성 품질을 평가했습니다. DeepSeek R1은 가장 높은 성능을 보였습니다 (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) 그리고 제로샷 평가에서 전문적인 생물 의학 모델인 BioGPT를 크게 능가했습니다. 둘째, 임상 추론 에이전트(미세 조정된 LLaMA)가 구조화된 설명을 생성하고, 근거 검색 에이전트가 PubMed를 쿼리하여 응답을 최신 문헌에 기반하도록 하며, 개선 에이전트(DeepSeek R1)가 명확성과 사실적 일관성을 향상시키는 모듈식 다중 에이전트 파이프라인을 구현했습니다. 고위험 또는 고불확실성 사례의 경우 선택적으로 인간 검증 단계를 수행합니다. 안전 장치로는 Monte Carlo dropout과 퍼플렉시티 기반 불확실성 점수, 그리고 LIME/SHAP 기반 분석을 통해 지원되는 어휘 및 감정 기반 편향 감지 기능을 포함합니다. 평가 결과, 전체 시스템은 87%의 정확도를 달성했으며, 관련성은 약 0.80입니다. 또한, 근거 증강은 기본 응답과 비교하여 불확실성을 줄입니다 (퍼플렉시티 4.13). 보고된 구성에서 평균 엔드-투-엔드 지연 시간은 36.5초입니다. 전반적으로, 결과는 에이전트 전문화 및 검증 계층이 주요 단일 모델의 한계를 완화하고 근거 기반 및 편향 인지 의료 인공지능을 위한 실용적이고 확장 가능한 설계를 제공할 수 있음을 시사합니다.

Original Abstract

Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

0 Citations

0 Influential

13 Altmetric

65.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!