AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset

Tobi Olatunji, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, Bonaventure F. P. Dossou, Moshood Yekini, Jonas Kemp, Katherine Heller, Jude Chidubem Omeke, Chidi Asuzu MD, Naome A. Etori, Aimérou Ndiaye, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best

2024-11-29

AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset

Summary

This paper presents AfriMed-QA, a new dataset designed to improve medical question-answering systems across Africa by providing a large collection of medical questions and answers from various specialties.

What's the problem?

In many low- and middle-income countries, there are not enough doctors and specialists to provide quality healthcare. While large language models (LLMs) can help by answering medical questions, their effectiveness in Africa has not been tested thoroughly. Existing medical datasets often do not represent the unique healthcare challenges faced in these regions, making it hard for LLMs to provide accurate information.

What's the solution?

The authors created AfriMed-QA, which includes 15,000 questions sourced from over 60 medical schools across 16 African countries. This dataset covers 32 medical specialties and includes both multiple-choice questions and open-ended questions. They evaluated 30 different LLMs to see how well they performed in answering these questions, focusing on accuracy and bias. The findings showed that while some models performed well, there were significant variations based on the specialty and region.

Why it matters?

This research is important because it provides a valuable resource for improving healthcare in Africa by enabling better training and evaluation of AI systems that can assist in medical decision-making. By addressing the specific needs of African healthcare, AfriMed-QA can help ensure that LLMs are more effective and reliable in providing medical information, ultimately improving patient care and access to healthcare services.

Abstract

Recent advancements in large language model(LLM) performance on medical multiple choice question (MCQ) benchmarks have stimulated interest from healthcare providers and patients globally. Particularly in low-and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, their effectiveness in the Global South, especially across the African continent, remains to be established. In this work, we introduce AfriMed-QA, the first large scale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias. Our findings show significant performance variation across specialties and geographies, MCQ performance clearly lags USMLE (MedQA). We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score. Interestingly, human evaluations show a consistent consumer preference for LLM answers and explanations when compared with clinician answers.

View Paper