How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
2025-02-21
Summary
This paper talks about how much large language models (LLMs) make mistakes, called hallucinations, in different languages when answering long and complex questions. It also introduces ways to measure and reduce these errors across multiple languages.
What's the problem?
LLMs often generate information that is incorrect or not based on facts, which is called hallucination. Most research on this problem focuses only on English and specific tasks like translation, but these models are used globally in many languages and for open-ended questions. This creates a gap in understanding how well they work in different languages and contexts.
What's the solution?
The researchers created a multilingual hallucination detection model and conducted a large-scale study across 30 languages using six types of open-source LLMs. They started by translating an English dataset into other languages to train the model and also manually checked data for five major languages to ensure accuracy. They then built a new dataset of questions and answers using Wikipedia as a reference to measure hallucination rates across languages. Their study found that larger models tend to hallucinate less than smaller ones, but the length of responses and the availability of resources for a language also affect how often hallucinations happen.
Why it matters?
This matters because it helps improve the reliability of LLMs in multilingual settings, making them more useful for people around the world. By identifying how often these models make mistakes in different languages and why, this research can guide the development of better AI systems that are accurate and trustworthy no matter what language they are used in.
Abstract
In the age of misinformation, hallucination -- the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses -- represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild'' than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering. To this end, we train a multilingual <PRE_TAG><PRE_TAG>hallucination detection model</POST_TAG></POST_TAG> and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to generate (noisy) training data in other languages. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build a knowledge-intensive QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. We find that, while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation. Further, we find that smaller LLMs exhibit larger hallucination rates than larger models.