OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang
2024-12-04

Summary
This paper discusses OHRBench, a new benchmark designed to evaluate how Optical Character Recognition (OCR) affects Retrieval-Augmented Generation (RAG) systems, particularly in terms of the accuracy of the information they provide.
What's the problem?
Retrieval-Augmented Generation (RAG) systems enhance large language models by using external knowledge to improve their responses. However, these systems often rely on OCR to extract information from unstructured documents like PDFs. Unfortunately, OCR can introduce errors, known as 'noise,' which can lead to inaccurate or misleading information being used by the RAG systems. This undermines the reliability of the outputs produced by these models.
What's the solution?
To address this issue, the researchers created OHRBench, a benchmark that includes 350 carefully selected PDF documents from various real-world applications. They identified two main types of OCR noise: Semantic Noise (errors in meaning) and Formatting Noise (issues with how text is presented). By testing current OCR solutions against this benchmark, they found that none were effective enough for creating high-quality knowledge bases for RAG systems. They also explored the potential of using Vision-Language Models (VLMs) without OCR to improve performance.
Why it matters?
This research is important because it highlights the challenges posed by OCR in ensuring the accuracy of information used by AI systems. By providing a comprehensive evaluation framework with OHRBench, it aims to improve future OCR technologies and enhance the reliability of RAG systems. This could lead to better AI applications in fields like education, law, and finance, where accurate information retrieval is crucial.
Abstract
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench