INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez
2024-12-03

Summary
This paper introduces INCLUDE, a new evaluation tool designed to assess how well multilingual language models understand and generate language in various regional contexts.
What's the problem?
Many large language models (LLMs) perform differently depending on the language, which can limit their effectiveness in different regions. A major issue is that most evaluation resources focus on English and do not consider the cultural and regional knowledge necessary for understanding other languages. This lack of high-quality evaluation tools makes it difficult to develop effective multilingual models that can serve diverse communities.
What's the solution?
INCLUDE addresses this problem by creating a comprehensive benchmark that includes 197,243 question-and-answer pairs from local exams in 44 different languages. These questions come from various sources, such as academic tests and professional certification exams, ensuring that they reflect the regional knowledge and cultural context needed for effective language understanding. This evaluation suite allows researchers to measure the performance of multilingual LLMs in realistic scenarios where they would be used.
Why it matters?
This research is significant because it helps improve the development of multilingual AI tools, making them more useful and relevant for people in different regions. By providing a way to evaluate how well these models understand various languages and cultures, INCLUDE can enhance the effectiveness of AI technologies in education, business, and everyday life, ultimately bridging the gap between different language communities.
Abstract
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.