< Explain other AI papers

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

Yufei Gao, Jiaying Fei, Nuo Chen, Ruirui Chen, Guohang Yan, Yunshi Lan, Botian Shi

2025-08-11

MELLA: Bridging Linguistic Capability and Cultural Groundedness for
  Low-Resource Language MLLMs

Summary

This paper talks about MELLA, a new dataset designed to help Multimodal Large Language Models (MLLMs) perform better in languages that don’t have much data available. MELLA improves the models both in understanding the language and in being aware of cultural details by using real native web descriptions and captions generated by AI.

What's the problem?

The problem is that while MLLMs work really well for popular languages with lots of data, they don’t do as well with low-resource languages because there isn’t enough training material. Most current methods rely only on translated text and miss important cultural and multimodal information, which leads to weak understanding and less detailed descriptions in these languages.

What's the solution?

To solve this, the researchers created MELLA, which brings together two kinds of data: native web alt-text that captures cultural context, and AI-generated captions that help with the language itself. By combining these two, they train models to be both more linguistically capable and more culturally grounded. The result is that these models produce richer, more detailed descriptions in multiple low-resource languages.

Why it matters?

This matters because it helps AI better understand and communicate in languages that are usually left behind due to limited data. By improving both language skills and cultural awareness, MELLA makes AI more useful and fair across different communities, helping more people access technology in their own language and cultural context.

Abstract

MELLA, a multimodal, multilingual dataset, enhances MLLMs in low-resource languages by improving linguistic capability and cultural groundedness through native web alt-text and MLLM-generated captions.