Multimodal Evaluation of Russian-language Architectures

Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova

2025-11-27

Multimodal Evaluation of Russian-language Architectures

Summary

This paper introduces a new way to test how well artificial intelligence models can understand and work with different types of information – text, images, audio, and video – specifically for the Russian language.

What's the problem?

Currently, there aren't good standardized tests to evaluate how well AI models handle multiple types of data, especially when it comes to languages like Russian. This makes it hard to know how 'smart' these models really are, what they struggle with, and potential risks they might pose. Existing tests are often designed for English and don't account for the unique aspects of other languages and cultures.

What's the solution?

The researchers created 'Mera Multi,' a complete testing framework for Russian-speaking AI models. They built 18 brand new tests covering different combinations of text, images, audio, and video. These tests are designed to be clear and consistent, and they also took steps to prevent cheating or 'leakage' of test data. They then tested several existing AI models, both publicly available and those from companies, to get a baseline understanding of their performance.

Why it matters?

This work is important because it provides a way to reliably measure the abilities of AI models in Russian, which is crucial for developing and deploying AI technology that works well for Russian speakers. Furthermore, the method they used to create this benchmark can be applied to other languages, especially those similar to Russian, helping to ensure AI is developed responsibly and effectively across different cultures and linguistic backgrounds.

Abstract

Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

View Paper