Unifying Multimodal Retrieval via Document Screenshot Embedding
Authors: Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin (David R. Cheriton School of Computer Science, University of Waterloo)
Abstract
The paper introduces the novel Document Screenshot Embedding (DSE) paradigm aimed at simplifying and improving the accuracy of document retrieval systems across varied modalities. Traditional retrieval systems, which rely heavily on tailored document parsing and content extraction, often result in information loss and are error-prone. In contrast, DSE preserves the entirety of a document's information by utilizing a unified document screenshot as input and leverages large vision-LLMs to encode these screenshots into dense representations. Comparative studies indicate that DSE significantly outperforms traditional text-based retrieval methods and OCR-dependent systems in varied document retrieval tasks.
Introduction
Document retrieval systems traditionally require tailored preprocessing to handle varied document types and their multimodal contents, such as text, images, and layouts. This preprocessing can be complex, error-prone, and may result in the loss of vital information. The paper proposes the DSE paradigm, which bypasses these preprocessing steps and instead uses document screenshots as unified input. The screenshots are encoded using large vision-LLMs to create dense document representations for retrieval.
Methodology
Task Definition
The task is defined as retrieving the top-k most relevant documents from a corpus for a given query. In this work, a "document" refers to an entire information snippet like a web article or PDF page.
Document Screenshot Embedding (DSE)
DSE employs a bi-encoder architecture where:
- Visual Encoder: Converts document screenshots into a sequence of latent representations using a vision encoder (e.g., \texttt{clip-vit-large-patch14-336}).
- Vision LLM: Integrates vision capabilities into the LLM using a large vision-LLM (e.g., Phi-3-vision) to process high-resolution images and handle more fine-grained information.
- Contrastive Learning: Uses the InfoNCE loss to maximize the similarity between the query and relevant document embeddings.
Results
Supervised Retrieval Effectiveness
DSE was evaluated using datasets such as Wiki-SS for Wikipedia webpages and SlideVQA for slides. Key findings include:
- In text-intensive tasks, DSE outperformed BM25 by 17 points in top-1 retrieval accuracy on questions from the Natural Questions (NQ) dataset.
- For mixed-modality tasks, DSE outperformed OCR-based text retrieval methods by over 15 points in nDCG@10 on the SlideVQA dataset.
Zero-Shot Retrieval Effectiveness
The generalization capability of DSE was also evaluated:
- On the TriviaQA dataset, DSE demonstrated superior zero-shot effectiveness compared to traditional methods, achieving 50.3% in top-1 retrieval accuracy.
- DSE notably outperformed BM25 in nDCG@10 by 8 points on the SlideVQA-open dataset.
Efficiency and Effectiveness Trade-Off
The paper explored the impact of increasing the number of crops (patches) used for encoding screenshots. While this improved retrieval accuracy, it resulted in decreased computational efficiency due to the longer sequence lengths requiring more processing time.
Implications and Future Developments
The DSE paradigm presents a significant step towards more robust and accurate multimodal document retrieval. By preserving the complete information of a document in screenshot format and directly encoding this with a vision-LLM, DSE addresses many limitations of traditional retrieval systems.
Future work could explore:
- Fine-tuning on more diverse document types such as arbitrary web pages and PDF files.
- Integrating DSE with text and image extraction for enhanced versatility.
- Applying contrastive pretraining methods to further improve DSE’s retrieval effectiveness.
Conclusion
DSE offers a unified approach to multimodal document retrieval that effectively encodes various document types without the need for extensive preprocessing. This method demonstrates substantial improvements over traditional retrieval paradigms, making it a promising direction for future research in the domain of multimodal information retrieval.
Limitations
The current paper focuses on specific datasets and may not be universally applicable to all document types. Additionally, the reliance on high-quality visual data emphasizes the need for further exploration into balancing image quality with computational efficiency. Future research is encouraged to address these limitations and explore pretraining methods to enhance DSE’s performance.