OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation (2412.02592v2)

Published 3 Dec 2024 in cs.CV

Abstract: Retrieval-augmented Generation (RAG) enhances LLMs by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG. To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG performance. Our OHRBench, including PDF documents, Q&As, and the ground truth structured data are released at: https://github.com/opendatalab/OHR-Bench

Summary

The paper introduces OHRBench to assess the cascading impact of OCR noise on building reliable knowledge bases for RAG systems.
It distinguishes between Semantic and Formatting Noise, showing at least a 7.5% performance gap in retrieval precision and generation accuracy.
The study suggests fine-tuning RAG frameworks and exploring Vision-Language Models to counter OCR-induced errors in complex, multimodal documents.

Impact of OCR on Retrieval-Augmented Generation Systems

This paper provides a comprehensive evaluation of how Optical Character Recognition (OCR) affects Retrieval-Augmented Generation (RAG) systems. The primary focus is on the negative impact of OCR noise on the formation and use of external knowledge bases within RAG frameworks, which is critical in enhancing LLMs by incorporating external data.

Firstly, the paper introduces OHRBench, a benchmark designed to assess the cascading influence of OCR errors on RAG systems. The benchmark comprises 350 unstructured PDF documents from six sectors—namely Textbook, Law, Finance, Newspaper, Manual, and Academia. It also includes question-answer pairs derived from multimodal document elements, which pose significant challenges to current OCR solutions. The inclusion of multimodal elements such as text, tables, and formulas underlines the complexity in constructing knowledge bases free from OCR-induced imperfections.

Through OHRBench, the authors identify two main types of OCR noise: Semantic Noise and Formatting Noise. Semantic Noise arises from OCR prediction errors resulting in misinterpretations of meaning, while Formatting Noise stems from the non-standard representation of parsed data. These noise types were found to impact RAG systems differently, with Semantic Noise generally having a more profound effect on retrieval and generation stages.

In terms of numerical results, the evaluation of existing OCR methods demonstrated a concerning gap. Even the leading OCR systems were unable to ensure the construction of high-quality knowledge bases necessary for effective RAG applications. For instance, the benchmark evaluation using OHRBench showed a minimum performance gap of 7.5% when comparing the best OCR-generated structured data to ground truth data. This performance drop was observable in both retrieval precision and generation accuracy, suggesting a substantial vulnerability inherent in current RAG systems against OCR-induced noise.

Explorations using OHRBench revealed that Semantic Noise consistently impacts retrieval modules and LLMs more than Formatting Noise, aligning with existing studies emphasizing noise sensitivity in RAG pipelines. Semantic Noise affects retrieval accuracy by introducing irrelevant or incorrect content, while Formatting Noise impacts structural consistency in document interpretation, particularly in tables and formulas. The varied effects highlight the necessity for RAG systems to be fine-tuned with OCR-specific solutions, prioritizing robustness against these identified noise patterns.

The paper also explores the potential for utilizing Vision-LLMs (VLMs) as alternatives to OCR in certain RAG contexts. Though initial findings indicate VLMs still lag behind traditional OCR in terms of accuracy with text-based inputs, the paper presents scenarios where combining visual and textual inputs from VLMs shrinks performance gaps significantly, suggesting a promising direction for future research.

Practically, the paper implies that as more RAG systems are deployed in real-world applications, the constraints posed by OCR may become more evident. Improving OCR methods to better handle complex layouts and multimodal document processing may mitigate some identified issues. Theoretically, these findings also reinforce the necessity of continued exploration into error-resilient RAG frameworks and the development of benchmarks like OHRBench to pioneer these advancements.

In conclusion, this paper comprehensively elucidates the detrimental effects of OCR noise on RAG systems, offering novel insights into prioritizing efforts for optimization and highlighting alternative approaches that could potentially navigate beyond OCR limitations. Future advancements in RAG will likely depend heavily on overcoming the challenges identified within this robust assessment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1865173182975742157

https://twitter.com/TheTuringPost/status/1867002872120086795