Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents (2504.09795v1)

Published 14 Apr 2025 in cs.CL, cs.AI, cs.CV, and cs.IR

Abstract: We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-LLMs for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ryota Tanaka (11 papers)
  2. Taichi Iki (4 papers)
  3. Taku Hasegawa (4 papers)
  4. Kyosuke Nishida (23 papers)
  5. Kuniko Saito (8 papers)
  6. Jun Suzuki (86 papers)

Summary

An Overview of VDocRAG: Retrieval-Augmented Generation for Visually-Rich Documents

The paper presents "VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents," a novel framework designed to enhance question answering capabilities over documents that combine text with complex visual and structural elements such as charts, tables, and diagrams. VDocRAG introduces a retrieval-augmented generation (RAG) model that processes documents in their native visual format rather than converting them to text, which often leads to information loss.

Framework Overview

VDocRAG consists of two main components: VDocRetriever and VDocGenerator. The VDocRetriever employs a large vision-LLM (LVLM) in a dual-encoder setting to retrieve relevant images related to the question from a corpus of document images. VDocGenerator then generates accurate answers based on these retrieved images. This approach aims to exploit both the textual and visual nuances embedded within documents, capturing the intricacies that purely text-based systems often miss.

To address the inherent challenges of understanding non-textual data, VDocRAG leverages high-resolution image encoding. The model dynamically crops images into smaller patches, maintaining their aspect ratio, which are then processed by an image encoder. The encoded visual features are aligned with their textual counterparts through specialized pre-training tasks.

Pre-training Strategies

The paper introduces innovative self-supervised pre-training tasks specifically designed for biasing LVLMs toward document retrieval tasks. These tasks, Representation Compression via Retrieval (RCR) and Generation (RCG), compress image representations into dense token representations and align them with text, thus enhancing the model's ability to retrieve and generate information from visually complex documents. This training strategy demonstrates superior representational learning by effectively leveraging both the understanding and generative capabilities of LVLMs.

OpenDocVQA: A Comprehensive Dataset

The authors contribute to the field with OpenDocVQA, a robust dataset for training and evaluating models on visually-rich documents across various formats like PDFs and websites. This dataset encompasses a plethora of document types and formats, promoting the development of models capable of handling diverse real-world scenarios. It supports both single- and multi-hop reasoning, making it a challenging benchmark for ongoing research.

Experimental Insights

The paper’s empirical evaluations indicate that VDocRAG significantly outperforms traditional text-based RAG models. It achieves substantial improvements in generalization capability across unseen datasets, such as ChartQA and SlideVQA, and corroborates the efficacy of embedding both visual and textual data in document retrieval and question answering. The analysis reveals that models initialized with and fine-tuned on this novel combination outperform those reliant solely on textual inputs.

For instance, VDocRAG exhibits notable performance gains, achieving an impressive nDCG@5 score across multiple test sets, thereby confirming its strong generalization and retrieval capabilities. Furthermore, it delivers enhanced accuracy in generating answers, validating the framework’s superiority in leveraging visual document elements.

Implications and Future Directions

The integration of visual information into RAG frameworks has substantial theoretical and practical implications. It highlights the importance of encompassing diverse data modalities in AI models aimed at document understanding and question answering. VDocRAG’s approach opens pathways for more generalizable and robust AI systems that can process complex multimodal data, which is pivotal in fields like legal analysis, academic research, and enterprise content management.

Future developments inspired by this work might explore more sophisticated models that unify text, image, and other data kinds, along with methods to further refine the efficiency of high-resolution image processing. Additionally, leveraging similar architectures could significantly advance AI capabilities in acquisition, comprehension, and dissemination of knowledge across various industry sectors, ultimately paving the way for more intelligent and contextually aware computer systems.