InkubaLM: A small language model for low-resource African languages

Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Aremu Anuoluwapo, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman

2024-09-02

InkubaLM: A small language model for low-resource African languages

Summary

This paper talks about InkubaLM, a small language model specifically designed to understand and generate text in low-resource African languages.

What's the problem?

Many existing language models perform well in high-resource languages (like English or Spanish) but struggle with African languages due to a lack of data and resources. This makes it hard for people who speak these languages to access technology that can understand them.

What's the solution?

InkubaLM is a small language model with only 0.4 billion parameters, which means it doesn't need as much computing power or data to function effectively. Despite its size, it performs comparably to much larger models on various tasks like translating text and answering questions. It also shows strong performance in understanding sentiments across different African languages.

Why it matters?

This research is important because it helps bridge the gap in technology for speakers of low-resource languages. By providing a model that works well with limited data, InkubaLM can improve access to information and technology for many people in Africa, supporting education, communication, and cultural preservation.

Abstract

High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available \url{https://huggingface.co/lelapa} to encourage research and development on low-resource languages.

View Paper