SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus

2024-06-17

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Summary

This paper introduces SEACrowd, a new resource hub and benchmark designed to improve AI models for Southeast Asian languages. It aims to provide better datasets and evaluation methods for the diverse languages spoken in this region.

What's the problem?

Southeast Asia is home to over 1,300 indigenous languages, but most existing AI models are primarily trained on English data. This leads to a lack of representation for SEA languages in AI systems, which can result in poor performance and cultural misrepresentation. Additionally, there are not enough high-quality datasets available to effectively evaluate AI models on these languages, making it difficult to assess their capabilities accurately.

What's the solution?

To tackle these issues, the authors created SEACrowd, which consolidates a wide range of resources in nearly 1,000 SEA languages. This hub provides standardized datasets that include text, images, and audio across three different modalities. SEACrowd also includes benchmarks that evaluate AI models on 36 indigenous languages through 13 different tasks, helping researchers understand how well these models perform in real-world scenarios.

Why it matters?

This research is significant because it addresses the resource gap for Southeast Asian languages in AI development. By providing better datasets and evaluation frameworks, SEACrowd aims to enhance the quality of AI models for these languages, promoting greater cultural representation and utility. This initiative can lead to more effective AI applications in education, communication, and other fields relevant to the diverse populations of Southeast Asia.

Abstract

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

View Paper