Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Published 5 Jun 2025 in cs.IR and cs.CL | (2506.04997v1)

Abstract: Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper demonstrates that token merging via semantic clustering outperforms token pruning in reducing patch-level embeddings with minimal loss in retrieval accuracy.
Fine-tuning post-merging is shown to recover accuracy losses even at high merging ratios, preserving over 94% of the original NDCG@5 score.
The proposed Light-ColPali/ColQwen2 model offers a practical solution for memory-efficient visual document retrieval, balancing efficiency and precision for scalable deployment.

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Introduction

Visual Document Retrieval (VDR) employs LVLMs to encode documents as image embeddings, eliminating the need for traditional OCR-based text parsing. This approach captures rich visual features, including layout and typography, enhancing retrieval accuracy. ColPali/ColQwen2 represents a state-of-the-art model in this domain, characterized by its use of numerous patch-level embeddings per document page to achieve high precision. However, this results in substantial memory consumption, posing challenges for scalability and deployment. The paper "Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings" (2506.04997) seeks to optimize VDR by reducing the number of patch-level embeddings without significant performance degradation.

Figure 1: Top: The relative memory consumptions for embedding storage of different VDRs. Our simple yet effective approach, Light-ColPali/ColQwen2, retains most of the performance but with significantly reduced memory cost. Bottom: The diagram of VDR equipped with ColPali/ColQwen2 retriever. It encodes each page into $N_p$ patch-level embeddings and thus incurs prohibitive memory cost.

Token Pruning: An Ineffective Strategy

The paper evaluates token pruning as a strategy to reduce memory usage in VDR systems. Various pruning approaches are assessed, including random and more systematic score-oriented and attention-oriented methods. It was found that random pruning, unexpectedly, outperformed others when significant token reduction was required. Pruning tends to falter because the relevance of each patch embedding is highly query-dependent—information unavailable at the offline indexing stage—making preemptive pruning decisions ineffective.

Figure 2: The activated patches overlap of two queries under different pruning ratios.

Token Merging: The Choices

The study shifts focus to token merging, a more viable strategy for reducing the number of embeddings. Merging consolidates multiple patch embeddings into fewer representations, retaining essential information while drastically cutting memory requirements. Three merging strategies—1D spatial pooling, 2D spatial pooling, and semantic clustering—were explored. Semantic clustering exhibited superior performance across datasets, maintaining retrieval precision even at high merging factors.

Figure 3: (a): Three merging approaches. The patches with the same colors are merged into the same embedding. (b): Three merging locations. Blue blocks represent the original modules in ColPali/ColQwen2. Orange blocks represent the added merging modules. (c): The architecture diagram of Light-Colpali/ColQwen2.

Fine-tuning the retriever post-merging proved essential for maintaining retrieval performance at extreme merging ratios, showcasing significant recovery of accuracy loss.

Figure 4: Training-free v.s. fine-tuning retriever with the same merging (clustering) approach. The performance of original ColQwen2 is highlighted in red dash.

Light-ColPali/ColQwen2: A Practical Solution

The paper introduces Light-ColPali/ColQwen2, a strategy integrating semantic clustering, late-stage merging, and fine-tuning. This approach achieves substantial memory efficiency while preserving retrieval effectiveness, offering a competitive baseline for efficient VDR systems.

The empirical evaluations demonstrate that Light-ColPali/ColQwen2 retains over 94% of original NDCG@5 scores at extremely low memory costs. This advancement underscores its suitability for deployment in memory-constrained environments, balancing efficiency and precision.

Conclusion

The paper presents a comprehensive exploration of storage-efficient VDR strategies, with clear evidence that token merging, particularly semantic clustering, provides a practical path forward. By establishing Light-ColPali/ColQwen2 as a benchmark for low-memory VDR solutions, the study contributes valuable insights to the pursuit of scalable and efficient retrieval systems. Future research may investigate adaptive merging strategies to further balance retrieval performance against memory constraints in variable document contexts.

Markdown Report Issue