2601.03471v1 Jan 06, 2026 cs.CL

EpiQAL: 역학적 질문 답변을 위한 대규모 언어 모델 성능 평가 - 개선된 정렬 및 추론을 목표로

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Qi He

Citations: 298

h-index: 3

Carl Yang

Citations: 28

h-index: 2

Mingyang Wei

Citations: 10

h-index: 1

Dehai Min

Citations: 9

h-index: 2

Zewen Liu

Citations: 204

h-index: 7

Yuzhang Xie

Citations: 85

h-index: 5

Guanchen Wu

Citations: 7

h-index: 2

Max S. Y. Lau

Citations: 151

h-index: 5

Lu Cheng

Citations: 3

h-index: 1

Wei Jin

Citations: 23

h-index: 3

신뢰할 수 있는 역학적 추론은 연구 증거를 종합하여 인구 수준에서 질병 부담, 전파 역학 및 개입 효과를 추론하는 것을 필요로 합니다. 기존의 의료 질문 답변 벤치마크는 주로 임상 지식 또는 환자 수준의 추론에 중점을 두지만, 증거 기반 역학적 추론을 체계적으로 평가하는 것은 드뭅니다. 본 연구에서는 다양한 질병을 포괄하는 역학적 질문 답변을 위한 최초의 평가 벤치마크인 EpiQAL을 제시합니다. EpiQAL은 공개된 문헌을 기반으로 구축된 세 개의 하위 집합으로 구성되어 있으며, 각각 텍스트 기반 사실 회수, 문헌 증거와 역학적 원리를 연결하는 다단계 추론, 그리고 논의 부분을 제외한 결론 재구성을 평가합니다. 벤치마크 구축에는 전문가가 설계한 분류 체계, 다중 모델 검증, 그리고 검색 기반 난이도 조절이 사용되었습니다. 열 개의 공개 모델에 대한 실험 결과, 현재의 LLM은 역학적 추론 능력에서 제한적인 성능을 보이며, 특히 다단계 추론이 가장 큰 어려움을 나타냅니다. 모델 순위는 하위 집합에 따라 달라지며, 모델 크기만으로는 성공을 예측할 수 없습니다. Chain-of-Thought 프롬프트는 다단계 추론에 도움이 되지만, 다른 영역에서는 혼합된 결과를 보여줍니다. EpiQAL은 증거 기반, 추론 능력 및 결론 재구성에 대한 세분화된 진단 정보를 제공합니다.

Original Abstract

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!