The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

Sheriff Issaka, Keyi Wang, Yinka Ajibola, Oluwatumininu Samuel-Ipaye, Zhaoyi Zhang, Nicte Aguillon Jimenez, Evans Kofi Agyei, Abraham Lin, Rohan Ramachandran, Sadick Abdul Mumin, Faith Nchifor, Mohammed Shuraim, Lieqi Liu, Erick Rosas Gonzalez, Sylvester Kpei, Jemimah Osei, Carlene Ajeneza, Persis Boateng, Prisca Adwoa Dufie Yeboah, Saadia Gabriel

2025-10-09

The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

Summary

This paper introduces the African Languages Lab (All Lab), a project focused on improving computer understanding of African languages, which are currently very poorly supported by existing technology.

What's the problem?

Most modern computer programs that work with language, like translation tools or voice assistants, are built using data from widely spoken languages like English and Mandarin. African languages, despite being spoken by a large number of people, have been largely ignored, meaning these technologies don't work well – or at all – for them. Over 88% of African languages are severely underrepresented in the field of computational linguistics.

What's the solution?

The researchers created a large, high-quality dataset of both text and speech for 40 different African languages. This dataset includes 19 billion words of text and over 12,600 hours of spoken audio. They then used this data to improve existing computer models, specifically by 'fine-tuning' them. This process significantly boosted the performance of these models on tasks like translation, showing an average improvement of over 23% in certain metrics. They also trained fifteen researchers from Africa to continue this work.

Why it matters?

This work is important because it helps bridge the digital language gap for a huge portion of the world’s population. By providing resources and training, the All Lab is making it possible to develop technologies that better serve African communities and preserve their linguistic diversity. The project also shows that with focused effort, it’s possible to quickly improve language technology for previously neglected languages, and even achieve results comparable to those of major commercial systems like Google Translate.

Abstract

Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.

View Paper