2305.09617 May 16, 2023 cs.AI

대규모 언어 모델을 활용한 전문가 수준의 의료 질의응답을 향하여

Towards Expert-Level Medical Question Answering with Large Language Models

Zhuoran Li

Citations: 0

h-index: 0

K. Singhal

Citations: 5,903

h-index: 10

Juraj Gottweis

Citations: 6,348

h-index: 11

R. Sayres

Citations: 6,852

h-index: 22

Ellery Wulczyn

Citations: 4,653

h-index: 23

Le Hou

Citations: 10,558

h-index: 15

Kevin Clark

Stanford University

Citations: 5,642

h-index: 16

S. Pfohl

Citations: 2,084

h-index: 20

H. Cole-Lewis

Citations: 7,777

h-index: 21

Darlene Neal

Citations: 1,463

h-index: 3

Mike Schaekermann

Citations: 5,354

h-index: 17

Amy Wang

Citations: 1,995

h-index: 6

Mohamed Amin

Citations: 764

h-index: 3

S. Lachgar

Citations: 820

h-index: 7

P. A. Mansfield

Citations: 5,633

h-index: 10

Sushant Prakash

Citations: 2,569

h-index: 10

Bradley Green

Citations: 3,025

h-index: 11

Ewa Dominowska

Citations: 2,403

h-index: 6

B. A. Y. Arcas

Citations: 31,080

h-index: 20

Nenad Tomašev

Citations: 13,456

h-index: 26

Yun Liu

Citations: 11,497

h-index: 41

Renee C Wong

Citations: 1,228

h-index: 4

Christopher Semturs

Citations: 8,825

h-index: 17

S. S. Mahdavi

Citations: 9,874

h-index: 17

J. Barral

Citations: 5,920

h-index: 15

D. Webster

Citations: 18,504

h-index: 31

G. Corrado

Citations: 12,854

h-index: 42

Yossi Matias

Google, Tel Aviv Univesity

Citations: 16,807

h-index: 53

Shekoofeh Azizi

Citations: 11,725

h-index: 30

A. Karthikesalingam

Citations: 4,252

h-index: 21

Vivek Natarajan

Citations: 12,566

h-index: 29

최근 인공지능(AI) 시스템은 바둑에서 단백질 접힘에 이르기까지 다양한 "거대 난제(grand challenges)"에서 획기적인 성과를 달성했습니다. 의학 지식을 검색하고 추론하여 의사와 대등한 수준으로 의료 질문에 답하는 능력은 오랫동안 그러한 거대 난제 중 하나로 간주되어 왔습니다. 대규모 언어 모델(LLM)은 의료 질의응답 분야의 상당한 진전을 촉진했습니다. Med-PaLM은 MedQA 데이터셋에서 67.2%의 점수를 기록하여 미국 의사 면허 시험(USMLE) 스타일의 질문에서 처음으로 "합격" 점수를 넘은 모델이 되었습니다. 그러나 이 연구와 기타 선행 연구들은 특히 모델의 답변을 임상의의 답변과 비교했을 때 여전히 상당한 개선의 여지가 있음을 시사했습니다. 이에 우리는 기본 LLM의 성능 향상(PaLM 2), 의료 도메인 미세 조정(finetuning), 그리고 새로운 앙상블 정제(ensemble refinement) 접근 방식을 포함한 프롬프팅 전략을 결합하여 이러한 격차를 해소하는 Med-PaLM 2를 소개합니다. Med-PaLM 2는 MedQA 데이터셋에서 최대 86.5%의 점수를 기록하여 Med-PaLM보다 19% 이상 성능을 향상시켰으며 새로운 최고 성능(state-of-the-art)을 달성했습니다. 또한 MedMCQA, PubMedQA 및 MMLU 임상 주제 데이터셋 전반에 걸쳐 최고 성능에 근접하거나 이를 능가하는 것을 확인했습니다. 우리는 임상 적용과 관련된 여러 평가 기준에 따라 장문형 질문에 대한 상세한 인간 평가를 수행했습니다. 1,066개의 소비자 의료 질문에 대한 쌍대 비교 순위 평가에서, 의사들은 임상적 유용성과 관련된 9가지 기준 중 8가지에서 의사가 작성한 답변보다 Med-PaLM 2의 답변을 선호했습니다(p < 0.001). 또한 LLM의 한계를 탐색하기 위해 새롭게 도입된 240개의 장문형 "적대적(adversarial)" 질문 데이터셋에서도 모든 평가 기준에서 Med-PaLM 대비 유의미한 개선이 관찰되었습니다(p < 0.001). 실제 환경에서 이러한 모델의 효용성을 검증하기 위해서는 추가 연구가 필요하지만, 이러한 결과는 의료 질의응답 분야에서 의사 수준의 성능을 향해 빠르게 발전하고 있음을 시사합니다.

Original Abstract

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

722 Citations

64 Influential

26.5 Altmetric

982.5 Score

Original PDF

AI Analysis

Korean Summary

이 논문은 Google의 최신 대규모 언어 모델인 PaLM 2를 기반으로 의료 도메인에 특화된 Med-PaLM 2를 제안합니다. 연구진은 의료 데이터셋에 대한 지시 미세 조정(Instruction Finetuning)과 새로운 프롬프팅 전략인 '앙상블 정제(Ensemble Refinement)'를 결합하여 모델을 개발했습니다. 그 결과, Med-PaLM 2는 미국 의사 면허 시험(USMLE) 스타일의 MedQA 데이터셋에서 86.5%의 정확도를 기록하며 기존 최고 성능(SOTA)을 경신했습니다. 또한, 의사들이 수행한 평가에서 Med-PaLM 2의 긴 서술형 답변이 실제 의사의 답변보다 9개 평가 항목 중 8개에서 더 우수한 것으로 평가받으며 임상적 유용성을 입증했습니다.

Key Innovations

향상된 기본 모델인 PaLM 2 활용 및 의료 도메인 특화 미세 조정
복잡한 의학적 추론 능력을 향상시키는 '앙상블 정제(Ensemble Refinement)' 프롬프팅 기법 도입
USMLE 스타일 질문(MedQA)에서 86.5%의 정확도로 SOTA 달성
의사와 일반인 평가자를 포함한 다차원적 인간 평가 루브릭 적용
모델의 안전성과 편향을 테스트하기 위한 새로운 적대적(Adversarial) 질문 데이터셋 구축

Learning & Inference Impact

학습 단계에서는 PaLM 2 모델에 MultiMedQA(다양한 의료 QA 데이터셋 모음)를 활용하여 의료 지식과 답변 스타일을 정렬시키는 지시 미세 조정을 수행했습니다. 추론 단계에서 핵심적인 변화는 '앙상블 정제' 기법의 도입입니다. 이 기법은 모델이 질문에 대해 여러 개의 가능한 추론 경로(CoT)를 먼저 생성하게 한 뒤(1단계), 이 생성된 추론들을 조건부 입력으로 다시 사용하여 최종적으로 정제된 답변을 생성(2단계)하는 방식입니다. 이는 단순한 다수결 투표(Self-Consistency)를 넘어 모델이 자신의 추론을 종합하고 개선하게 함으로써 정확도를 크게 높였으나, 추론 시 연산 비용은 증가하게 됩니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!