Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Joshua Harris, Fan Grayson, Felix Feldman, Timothy Laurence, Toby Nonnenmacher, Oliver Higgins, Leo Loman, Selina Patel, Thomas Finnie, Samuel Collins, Michael Borowitz

2025-05-12

Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health
Information

Summary

This paper talks about PubHealthBench, a new set of tests created to see how well large language models know and understand UK Government public health information.

What's the problem?

The problem is that even though AI language models are used more and more to answer health questions, we don't really know how accurate or reliable they are when it comes to specific UK public health advice. This is important because giving the wrong information could affect people's health.

What's the solution?

The researchers built a huge benchmark with over 8,000 questions based on real UK government health documents. They tested 24 different language models using both multiple-choice and open-ended questions. The results showed that the best models did extremely well on multiple-choice questions, even better than people using search engines, but didn't do as well when answering open-ended questions without choices.

Why it matters?

This matters because it shows that AI models can be a strong source of public health information, but there are still risks if they are used without checks, especially for more open or complicated questions. The benchmark helps researchers and developers make these models safer and more reliable for real-world health advice.

Abstract

A new benchmark, PubHealthBench, evaluates LLMs' knowledge of UK Government public health information, finding high accuracy in MCQA but lower performance in free form responses.

View Paper