MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš

2025-02-20

MVL-SIB: A Massively Multilingual Vision-Language Benchmark for
Cross-Modal Topical Matching

Summary

This paper talks about MVL-SIB, a new way to test how well AI models can understand and match topics across different languages and types of information (like text and images). It's like creating a giant, multilingual picture-word association game for computers.

What's the problem?

Current tests for AI models that work with both language and images mostly focus on popular languages, leaving out many less common ones. This means we don't really know how well these AI models perform when dealing with languages that aren't widely spoken or studied. It's like only testing a translator's skills in common languages and not knowing how they'd do with rare ones.

What's the solution?

The researchers created MVL-SIB, a huge test that covers 205 different languages, which is over 100 more than any other similar test. This test checks how well AI models can match topics between text and images, and also how they handle text-only tasks. They then used this test on various AI models, including some very advanced ones, to see how they performed across all these languages.

Why it matters?

This matters because as AI becomes more global, we need to make sure it works well for everyone, not just speakers of major languages. The study showed that current AI models struggle with less common languages and don't use multiple images effectively. By identifying these weaknesses, we can work on making AI that's truly inclusive and capable of understanding diverse languages and cultures. This could lead to better translation tools, more accessible technology, and AI assistants that can help people regardless of what language they speak.

Abstract

Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in <PRE_TAG>cross-modal topic</POST_TAG> matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.

View Paper