MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment

Omid Ghahroodi, Arshia Hemmat, Marzia Nouri, Seyed Mohammad Hadi Hosseini, Doratossadat Dastgheib, Mohammad Vali Sanian, Alireza Sahebi, Reihaneh Zohrabi, Mohammad Hossein Rohban, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah

2025-08-26

MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment

Summary

This paper introduces a new dataset called MEENA, designed to test how well vision-language models (VLMs) work with the Persian language, not just English.

What's the problem?

Most of the recent improvements in VLMs, which are AI systems that can understand both images and text, have been focused on English. This means these models aren't very good at understanding other languages like Persian, and there wasn't a good way to specifically test and improve their Persian language skills. Essentially, there was a lack of resources to evaluate how well these models understand visual information *in* Persian.

What's the solution?

The researchers created MEENA, a dataset containing around 7,500 questions in Persian and 3,000 in English. These questions require the model to use reasoning, math, physics knowledge, and understand things like charts, diagrams, and even Persian art and literature. The dataset also includes information about how hard each question is and detailed answers. They then tested different models on this dataset to see how well they performed, looking at things like whether they correctly focused on the important parts of images and if they made up information (hallucinations).

Why it matters?

This work is important because it helps push the development of VLMs beyond just English. By providing a benchmark for Persian, it encourages researchers to build models that can understand and process information in more languages, making these AI tools more accessible and useful to a wider range of people and cultures.

Abstract

Recent advancements in large vision-language models (VLMs) have primarily focused on English, with limited attention given to other languages. To address this gap, we introduce MEENA (also known as PersianMMMU), the first dataset designed to evaluate Persian VLMs across scientific, reasoning, and human-level understanding tasks. Our dataset comprises approximately 7,500 Persian and 3,000 English questions, covering a wide range of topics such as reasoning, mathematics, physics, diagrams, charts, and Persian art and literature. Key features of MEENA include: (1) diverse subject coverage spanning various educational levels, from primary to upper secondary school, (2) rich metadata, including difficulty levels and descriptive answers, (3) original Persian data that preserves cultural nuances, (4) a bilingual structure to assess cross-linguistic performance, and (5) a series of diverse experiments assessing various capabilities, including overall performance, the model's ability to attend to images, and its tendency to generate hallucinations. We hope this benchmark contributes to enhancing VLM capabilities beyond English.

View Paper