All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava

2024-11-26

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Summary

This paper presents the All Languages Matter Benchmark (ALM-bench), a new evaluation tool designed to assess large multimodal models (LMMs) across 100 different languages, focusing on cultural diversity and inclusivity.

What's the problem?

Most existing LMMs are primarily trained on data from a few regions and languages, which limits their ability to understand and respect different cultural contexts. This is especially problematic for low-resource languages that are often overlooked, making it hard for these models to provide accurate and sensitive responses in diverse settings.

What's the solution?

The authors created ALM-bench, which includes a wide range of culturally diverse images paired with text in various languages. This benchmark tests LMMs on their ability to reason about these images and texts, using various question formats like true/false and multiple choice. It covers 13 cultural aspects, such as traditions and celebrations, ensuring a comprehensive evaluation of how well models can handle different cultural nuances.

Why it matters?

This research is important because it pushes the boundaries of AI models to be more inclusive and culturally aware. By evaluating LMMs on a diverse set of languages and cultural contexts, ALM-bench encourages the development of models that can serve global populations more effectively, ultimately leading to better understanding and communication across cultures.

Abstract

Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model's ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.

View Paper