CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo
2024-06-14

Summary
This paper introduces CVQA, a new benchmark for Visual Question Answering (VQA) that focuses on cultural diversity and multilingualism. It aims to improve how AI models understand and respond to visual questions by incorporating a wider range of cultural perspectives and languages.
What's the problem?
Most existing VQA datasets are primarily based on English and Western cultures, which limits the ability of AI models to understand diverse cultural contexts. While some efforts have been made to include more languages, these datasets often still use the same images, leading to a narrow representation of global cultures. This lack of diversity can result in biases in how AI models interpret and respond to questions about images.
What's the solution?
To address these issues, the authors created CVQA, which includes culturally relevant images and questions collected from 28 countries across four continents. The dataset covers 26 different languages and includes 9,000 questions. Native speakers and cultural experts were involved in the data collection process to ensure that the images and questions reflect diverse cultural perspectives. This approach helps create a more inclusive dataset that can better train AI models to understand and reason about visual information from various cultures.
Why it matters?
This research is important because it promotes greater cultural awareness and linguistic diversity in AI systems. By developing a benchmark like CVQA, researchers can evaluate how well AI models perform in understanding different cultural contexts, which can lead to improvements in AI applications such as education, customer service, and content creation. Ultimately, this work aims to reduce biases in AI and make technology more accessible and relevant to people from all backgrounds.
Abstract
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.