BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation
Bryan Li, Samar Haider, Fiona Luo, Adwait Agashe, Chris Callison-Burch
2024-10-03

Summary
This paper presents BordIRlines, a new dataset designed to evaluate how well language models can retrieve and generate information across different languages, especially in complex situations like geopolitical disputes.
What's the problem?
Large language models (LLMs) are great at generating text but often struggle with accuracy and bias, especially when they need to pull information from multiple languages. This can lead to confusion and incorrect answers when the model encounters conflicting information from different sources or languages.
What's the solution?
The authors created the BordIRlines dataset, which includes data sourced from Wikipedia about geopolitical issues that involve different languages and cultures. They investigated how providing additional context and varying the language of the sources affects the model's responses. The results showed that existing retrieval-augmented generation (RAG) systems had difficulty maintaining consistency when faced with competing information in multiple languages.
Why it matters?
This research is important because it helps improve the reliability of AI systems that need to understand and generate text based on information from different languages. By studying how LLMs handle cross-lingual queries, researchers can develop better methods for ensuring that these models provide accurate and unbiased information, which is crucial for applications in global communication, diplomacy, and education.
Abstract
Large language models excel at creative generation but continue to struggle with the issues of hallucination and bias. While retrieval-augmented generation (RAG) provides a framework for grounding LLMs' responses in accurate and up-to-date information, it still raises the question of bias: which sources should be selected for inclusion in the context? And how should their importance be weighted? In this paper, we study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems at answering queries about geopolitical disputes, which exist at the intersection of linguistic, cultural, and political boundaries. Our dataset is sourced from Wikipedia pages containing information relevant to the given queries and we investigate the impact of including additional context, as well as the composition of this context in terms of language and source, on an LLM's response. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages. We present case studies to illustrate these issues and outline steps for future research to address these challenges. We make our dataset and code publicly available at https://github.com/manestay/bordIRlines.