Light-ColPali/ColQwen2: Storage-Efficient Visual Document Retrieval
Last updated: June 10, 2025
Storage-Efficient Visual Document Retrieval with Light-ColPali/ColQwen2
Recent developments in visualized document retrieval (VDR) systems, such as ColPali ° and ColQwen2, have demonstrated that direct processing of document images with vision-LLMs (VLMs) and patch-level multi-vector embeddings yields strong retrieval accuracy across visually rich and textually complex documents. However, these models incur substantial storage and memory overhead, as indexing typically generates dozens or hundreds of patch-level vectors per page, limiting scalability for large-scale retrieval tasks. Light-ColPali/ColQwen2 introduces principled methods for reducing storage demands while minimally impacting retrieval quality °, providing new baselines for efficient VDR systems (Ma et al., 5 Jun 2025 ° ).
Significance and Motivation
ColPali and ColQwen2 move beyond OCR-based approaches by employing neural backbones to encode page images directly into sets of contextualized patch embeddings. This enables fine-grained visual-textual matching and late-interaction retrieval, but the approach requires maintaining a large number of vectors per page—often up to two orders of magnitude greater than classical dense vector systems (Ma et al., 5 Jun 2025 ° ). High memory consumption impedes large-scale deployment and practical use in memory-constrained environments.
Attempts to reduce representation size by pruning or compressing embeddings have not been systematically analyzed for VDR. Therefore, a fundamental challenge is to achieve high storage efficiency while maintaining robust, query-agnostic retrieval effectiveness.
Token Reduction Strategies
Light-ColPali/ColQwen2 systematically investigates two token reduction ° strategies applied during offline indexing, where future queries are unknown:
1. Token Pruning
Token pruning ° seeks to remove less important patch embeddings from each document page. The paper benchmarks three methods:
- Random Pruning: Uniformly drops a proportion of patch vectors at random.
- Score-Oriented Pruning: Drops vectors with the lowest informativeness based on synthesized query relevance °.
- Attention °-Oriented Pruning: Drops vectors receiving the least attention in the model’s internal representations °.
Empirical results show that random pruning consistently outperforms both score- and attention-based methods, especially at high pruning ratios, contrary to expectation. For instance, with a pruning ratio ° of 0.95 on InfoVQA, random pruning yields 3.9% and 19.6% higher NDCG@5 ° than score- and attention-based strategies, respectively (Ma et al., 5 Jun 2025 ° ). However, no pruning method ° supports order-of-magnitude reduction in memory usage without severe quality loss. The cause is that the relevance of patch embeddings is highly query-dependent, and query-agnostic pruning will inevitably remove critical information for some queries.
2. Token Merging
Token merging ° aggregates groups of patch embeddings to form a smaller set of vectors, potentially preserving essential semantic and visual content °. The strategies evaluated include:
- 1D Spatial Pooling: Merges tokens in sequential (flattened) order.
- 2D Spatial Pooling: Merges spatially adjacent embeddings within the page layout.
- Semantic Clustering °: Groups embeddings by cosine similarity ° in the final representation space, then averages them.
Semantic clustering, performed at the last stage of the pipeline (after projection into the retrieval embedding space), combined with fine-tuning on merged representations, provides the most effective trade-off. Fine-tuning is critical for recovery of performance at high compression rates °, restoring 61–67% of the accuracy lost due to merging (Ma et al., 5 Jun 2025 ° ).
Empirical Performance and Trade-Offs
Retrieval quality is measured by NDCG@5, and memory footprint is expressed relative to the single-vector DSE ° baseline. Key results (Ma et al., 5 Jun 2025 ° ) are summarized below:
Model | Memory Use (vs. DSE) | NDCG@5 (relative) | Method |
---|---|---|---|
ColQwen2 (original) | 64.4× | 100% | Full patch-level embeddings |
Light-ColQwen2 (mf=9) | 7.6× | 98.2% | Semantic clustering, fine-tuned, merging factor 9 |
Light-ColQwen2 (mf=49) | 1.8× | 94.6% | As above, with greater compression |
DSE-Qwen2 (single vector) | 1.0× | 91% | Single page embedding |
- With a merging factor of 9, Light-ColQwen2 reduces memory usage to 11.8% of the original (7.6× DSE), while retaining 98.2% of retrieval effectiveness.
- At a merging factor of 49 (approximate 2% of original memory), it preserves 94.6% of NDCG@5—outperforming the DSE-Qwen2 single-vector baseline (Ma et al., 5 Jun 2025 ° ).
- Similar results hold for the PaliGemma-3B backbone, with comparable reductions in memory and minimal accuracy loss.
Fine-tuning after merging is essential, as it recovers the majority of lost performance, particularly at higher compression rates (Ma et al., 5 Jun 2025 ° ).
Implementation Considerations
- Computational Overhead: Semantic clustering and merging add 3–3.5 hours to training and about 1 minute per 500 pages during offline embedding generation, which is moderate relative to full indexing or fine-tuning (Ma et al., 5 Jun 2025 ° ).
- Dataset Sensitivity: Performance is most robust on visually sparse documents; more significant drops occur for densely populated, text-heavy pages (e.g., DocVQA, TAT-DQA) when using aggressive merging.
- LaTeX similarity function: For late-interaction retrieval:
where are the query token embeddings ° and are the (possibly merged) patch embeddings (Ma et al., 5 Jun 2025 ° ).
Practical Impact and Open Directions
Light-ColPali/ColQwen2 sets a new competitive baseline for storage-efficient VDR, attaining order-of-magnitude reductions in memory consumption with minimal loss in retrieval effectiveness. The methods are validated on nine datasets and two major VDR backbones, demonstrating robustness across a range of visual document types ° (Ma et al., 5 Jun 2025 ° ).
Suggested Areas for Future Research
- Adaptive Merging: Developing mechanisms to dynamically set the merging factor based on document density or complexity.
- Complementary Compression: Exploring integration with dimensionality reduction, quantization, data cleaning, or model distillation ° for further efficiency gains.
- Generalization: Evaluating transferability to other multi-vector retrieval tasks beyond VDR.
- Scalability: Leveraging these memory reductions for deployment in truly large-scale or web-scale document retrieval environments.
Limitations
- Token pruning, even when employing query-informed or attention-based heuristics, is fundamentally limited for VDR; the random method’s modest success owes to redundancy in patch grouping but does not suffice for high compression.
- Small but non-negligible accuracy losses exist at the highest compression rates and on data with very dense information layouts, requiring practitioners to set the merging factor according to their own quality requirements (Ma et al., 5 Jun 2025 ° ).
Speculative Note
A possible future line, not directly tested in the cited paper, is the use of light-weight, query-time token selection ° or hybrid compression methods that combine semantic clustering with other techniques such as vector quantization ° or sequence distillation.
References
All technical details, experimental findings, and claims are derived from "Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings" (Ma et al., 5 Jun 2025 ° ). See also the associated GitHub repository: https://github.com/illuin-tech/colpali.