BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan

2025-02-13

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large
Language Models

Summary

This paper talks about BenchMAX, a new way to test how well AI language models can handle different languages and complex tasks. It's like creating a super-advanced, multilingual SAT test for AI.

What's the problem?

Current tests for AI language models mostly focus on simple tasks and don't do a good job of checking how well these models can handle tricky things like following instructions, reasoning, or coding in many different languages. This makes it hard to know if an AI is truly good at working with multiple languages or just really good at English.

What's the solution?

The researchers created BenchMAX, which tests AI models on advanced tasks in 17 different languages. They carefully translated the test from English to other languages and had native speakers check the translations. BenchMAX includes tasks that test how well AI can follow instructions, think through problems, understand long texts, and write code in various languages. They also made sure the test was fair and accurate across all languages.

Why it matters?

This matters because as AI becomes more common in our daily lives, we need to make sure it works well for everyone, not just English speakers. BenchMAX helps researchers see where AI models are strong or weak in different languages, which can guide them in making better, more inclusive AI systems. It also shows that just making AI models bigger doesn't automatically make them better at all languages, highlighting the need for more focused development in multilingual AI capabilities.

Abstract

Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.

View Paper