GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Hengyu Luo, Zihao Li, Joseph Attieh, Sawal Devkota, Ona de Gibert, Shaoxiong Ji, Peiqin Lin, Bhavani Sai Praneeth Varma Mantina, Ananda Sreenidhi, Raúl Vázquez, Mengjie Wang, Samea Yusofi, Jörg Tiedemann

2025-04-08

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large
Language Models

Summary

This paper talks about GlotEval, a testing tool that checks how well AI language models work across many languages, especially less common ones, like seeing if a chatbot can understand and respond correctly in Swahili or Navajo.

What's the problem?

Current AI testing focuses mostly on English and a few popular languages, leaving out thousands of others, which makes it hard to know if AI tools work well for people speaking less common languages.

What's the solution?

GlotEval creates standard tests for over 1,500 languages, uses language-specific instructions, and checks translation skills without always using English as a middle step, helping spot where AI struggles.

Why it matters?

This helps make AI tools fairer and more useful worldwide, ensuring apps like translators or voice assistants work well for everyone, not just English speakers or big language groups.

Abstract

Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval's applicability for multilingual and language-specific evaluations.

View Paper