EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin

2026-01-08

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Summary

This paper introduces EpiQAL, a new test designed to see how well computer programs can understand and reason about public health information, specifically how diseases spread and how to fight them.

What's the problem?

Currently, most tests for computer programs that answer medical questions focus on individual patients or clinical facts. There wasn't a good way to test if these programs could actually understand and *use* research studies to figure out things like how many people have a disease, how it's spreading, or if a treatment is working on a population level. Essentially, existing tests didn't check if AI could do the kind of thinking epidemiologists do.

What's the solution?

The researchers created EpiQAL, which includes questions about different diseases. These questions aren't just about remembering facts; some require the program to combine information from multiple sources and apply epidemiological principles to reach an answer. They also tested how well the programs could come to a conclusion even when part of the research paper (the 'Discussion' section) was hidden. They then tested ten different AI models on these questions to see how they performed.

Why it matters?

This work is important because it shows that current AI models aren't very good at epidemiological reasoning, even the really big ones. EpiQAL provides a way to pinpoint *where* these models struggle – whether it's finding the right information, connecting the dots between different pieces of evidence, or drawing logical conclusions. This helps researchers improve AI so it can be a useful tool for public health officials.

Abstract

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.

View Paper