When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova

2025-10-17

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Summary

This paper introduces a new dataset called PsiloQA designed to help researchers identify when large language models (LLMs) are 'hallucinating' – essentially, making up facts. It focuses on doing this accurately and across many different languages.

What's the problem?

Large language models are great, but they sometimes confidently state things that aren't true. Current methods for testing this 'hallucination' problem are often limited because they only look at whether the *whole* response is correct or incorrect, and they mostly focus on English. We need a way to pinpoint *exactly* where an LLM goes wrong, and we need to be able to do this in multiple languages to ensure these models are reliable globally.

What's the solution?

The researchers created PsiloQA, a large dataset with questions and answers in 14 different languages. They used a powerful LLM, GPT-4o, to generate questions from Wikipedia, then asked other LLMs to answer those questions *without* any extra information. Finally, they used GPT-4o again to automatically compare the LLM's answers to the correct answers from Wikipedia and identify the specific parts of the LLM's response that were made up. They then tested different methods for detecting these hallucinations, finding that models specifically designed to 'encode' information performed the best.

Why it matters?

This work is important because it provides a better way to evaluate and improve the factual accuracy of LLMs, especially as they are used in more and more real-world applications. PsiloQA is a cost-effective and scalable way to test for hallucinations in many languages, and it can help researchers build more trustworthy and reliable AI systems. The dataset also shows that improvements made in one language can often be applied to others.

Abstract

Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.

View Paper