Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

2024-10-22

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Summary

This paper presents a new evaluation method called Cross-Lingual Auto Evaluation (CIA) Suite, designed to better assess multilingual large language models (LLMs) by addressing the challenges of evaluating text generated in multiple languages.

What's the problem?

Evaluating text produced by machines is tough, especially for languages other than English. Most current methods focus mainly on English, which leaves a big gap in how we evaluate models that can work in many languages.

What's the solution?

The researchers created the CIA Suite, which includes a special evaluation model named Hercule and a new test set called Recon. This test set has 500 tasks in different languages with human ratings, allowing for better benchmarking of multilingual models. Hercule helps score responses based on English references, making it easier to evaluate texts in other languages effectively.

Why it matters?

This work is crucial because it improves how we assess the performance of language models that operate in multiple languages, ensuring they are reliable and effective for users around the world, especially in low-resource language scenarios.

Abstract

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

View Paper