Unifying Multimodal Retrieval via Document Screenshot Embedding (2406.11251v2)

Published 17 Jun 2024 in cs.IR

Abstract: In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-LLM to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.

PDF HTML Abstract

Unifying Multimodal Retrieval via Document Screenshot Embedding

Authors: Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin (David R. Cheriton School of Computer Science, University of Waterloo)

Abstract

The paper introduces the novel Document Screenshot Embedding (DSE) paradigm aimed at simplifying and improving the accuracy of document retrieval systems across varied modalities. Traditional retrieval systems, which rely heavily on tailored document parsing and content extraction, often result in information loss and are error-prone. In contrast, DSE preserves the entirety of a document's information by utilizing a unified document screenshot as input and leverages large vision-LLMs to encode these screenshots into dense representations. Comparative studies indicate that DSE significantly outperforms traditional text-based retrieval methods and OCR-dependent systems in varied document retrieval tasks.

Introduction

Document retrieval systems traditionally require tailored preprocessing to handle varied document types and their multimodal contents, such as text, images, and layouts. This preprocessing can be complex, error-prone, and may result in the loss of vital information. The paper proposes the DSE paradigm, which bypasses these preprocessing steps and instead uses document screenshots as unified input. The screenshots are encoded using large vision-LLMs to create dense document representations for retrieval.

Methodology

Task Definition

The task is defined as retrieving the top-k most relevant documents from a corpus for a given query. In this work, a "document" refers to an entire information snippet like a web article or PDF page.

Document Screenshot Embedding (DSE)

DSE employs a bi-encoder architecture where:

Visual Encoder: Converts document screenshots into a sequence of latent representations using a vision encoder (e.g., \texttt{clip-vit-large-patch14-336}).
Vision LLM: Integrates vision capabilities into the LLM using a large vision-LLM (e.g., Phi-3-vision) to process high-resolution images and handle more fine-grained information.
Contrastive Learning: Uses the InfoNCE loss to maximize the similarity between the query and relevant document embeddings.

Results

Supervised Retrieval Effectiveness

DSE was evaluated using datasets such as Wiki-SS for Wikipedia webpages and SlideVQA for slides. Key findings include:

In text-intensive tasks, DSE outperformed BM25 by 17 points in top-1 retrieval accuracy on questions from the Natural Questions (NQ) dataset.
For mixed-modality tasks, DSE outperformed OCR-based text retrieval methods by over 15 points in nDCG@10 on the SlideVQA dataset.

Zero-Shot Retrieval Effectiveness

The generalization capability of DSE was also evaluated:

On the TriviaQA dataset, DSE demonstrated superior zero-shot effectiveness compared to traditional methods, achieving 50.3% in top-1 retrieval accuracy.
DSE notably outperformed BM25 in nDCG@10 by 8 points on the SlideVQA-open dataset.

Efficiency and Effectiveness Trade-Off

The paper explored the impact of increasing the number of crops (patches) used for encoding screenshots. While this improved retrieval accuracy, it resulted in decreased computational efficiency due to the longer sequence lengths requiring more processing time.

Implications and Future Developments

The DSE paradigm presents a significant step towards more robust and accurate multimodal document retrieval. By preserving the complete information of a document in screenshot format and directly encoding this with a vision-LLM, DSE addresses many limitations of traditional retrieval systems.

Future work could explore:

Fine-tuning on more diverse document types such as arbitrary web pages and PDF files.
Integrating DSE with text and image extraction for enhanced versatility.
Applying contrastive pretraining methods to further improve DSE’s retrieval effectiveness.

Conclusion

DSE offers a unified approach to multimodal document retrieval that effectively encodes various document types without the need for extensive preprocessing. This method demonstrates substantial improvements over traditional retrieval paradigms, making it a promising direction for future research in the domain of multimodal information retrieval.

Limitations

The current paper focuses on specific datasets and may not be universally applicable to all document types. Additionally, the reliance on high-quality visual data emphasizes the need for further exploration into balancing image quality with computational efficiency. Future research is encouraged to address these limitations and explore pretraining methods to enhance DSE’s performance.