MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
Amir Hossein Kargaran, Ali Modarressi, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
2024-10-10

Summary
This paper introduces MEXA, a new method for evaluating how well English-centric large language models (LLMs) can understand and work with multiple languages.
What's the problem?
While many LLMs are trained primarily on English data, their ability to perform well in other languages is not well understood. Most existing evaluations focus on a few languages or specific tasks, leaving a gap in knowledge about how these models handle multilingual tasks overall.
What's the solution?
MEXA addresses this issue by using parallel sentences, which are sentences translated into different languages, to assess the multilingual capabilities of LLMs. The method measures how well the models align their understanding of English with other languages. By analyzing data from various models and datasets, MEXA provides insights into the strengths and weaknesses of these models in different languages. The results showed that MEXA could reliably estimate the performance of LLMs across multiple languages with a high correlation to established tasks.
Why it matters?
This research is important because it helps improve our understanding of how well AI models can work with languages other than English. By providing a reliable evaluation method, MEXA can guide future developments in multilingual AI, making it possible for these systems to better serve diverse populations and applications worldwide.
Abstract
English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code: https://github.com/cisnlp/Mexa.