NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed

2025-05-26

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for
Local Communities

Summary

This paper talks about NileChat, a new language model designed to understand and communicate in Egyptian and Moroccan dialects, which are forms of Arabic that don't have as much digital data as other languages.

What's the problem?

The problem is that most language models are trained on big, popular languages and often don't work well for smaller languages or local dialects, leaving many communities without good AI tools that understand their way of speaking and their culture.

What's the solution?

The researchers developed a special way to gather and create training data that matches the language and cultural style of these local dialects. They used this method to build NileChat, a powerful language model that performs better than other similar-sized models when it comes to Egyptian and Moroccan Arabic.

Why it matters?

This is important because it helps make AI more inclusive, giving people in different communities access to technology that understands and respects their language and culture, which can improve communication, education, and access to information.

Abstract

A methodology is proposed to create pre-training data tailored to low-resource languages and cultures, demonstrated through NileChat, a 3B parameter LLM for Egyptian and Moroccan dialects, which outperforms existing similar-sized Arabic-aware models.

View Paper