Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts

Detailed Answer

Thorough responses based on abstracts and some paper content

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

73 tokens/sec

Gemini 2.5 Pro Pro

63 tokens/sec

o3 Pro

25 tokens/sec

GPT-4.1 Pro

71 tokens/sec

DeepSeek R1 via Azure Pro

22 tokens/sec

2000 character limit reached

Light-ColPali/ColQwen2: Storage-Efficient Visual Document Retrieval

Last updated: June 10, 2025

Storage-Efficient Visual Document Retrieval with Light-ColPali/ColQwen2

Recent developments in visualized document retrieval (VDR) systems, such as ColPali ° and ColQwen2, have demonstrated that direct processing of document images with vision-LLMs (VLMs) and patch-level multi-vector embeddings yields strong retrieval accuracy across visually rich and textually complex documents. However, these models incur substantial storage and memory overhead, as indexing typically generates dozens or hundreds of patch-level vectors per page, limiting scalability for large-scale retrieval tasks. Light-ColPali/ColQwen2 introduces principled methods for reducing storage demands while minimally impacting retrieval quality °, providing new baselines for efficient VDR systems (Ma et al., 5 Jun 2025 ° ).

Significance and Motivation

ColPali and ColQwen2 move beyond OCR-based approaches by employing neural backbones to encode page images directly into sets of contextualized patch embeddings. This enables fine-grained visual-textual matching and late-interaction retrieval, but the approach requires maintaining a large number of vectors per page—often up to two orders of magnitude greater than classical dense vector systems (Ma et al., 5 Jun 2025 ° ). High memory consumption impedes large-scale deployment and practical use in memory-constrained environments.

Attempts to reduce representation size by pruning or compressing embeddings have not been systematically analyzed for VDR. Therefore, a fundamental challenge is to achieve high storage efficiency while maintaining robust, query-agnostic retrieval effectiveness.

Token Reduction Strategies

Light-ColPali/ColQwen2 systematically investigates two token reduction ° strategies applied during offline indexing, where future queries are unknown:

1. Token Pruning

Token pruning ° seeks to remove less important patch embeddings from each document page. The paper benchmarks three methods:

Random Pruning: Uniformly drops a proportion of patch vectors at random.
Score-Oriented Pruning: Drops vectors with the lowest informativeness based on synthesized query relevance °.
Attention °-Oriented Pruning: Drops vectors receiving the least attention in the model’s internal representations °.

Empirical results show that random pruning consistently outperforms both score- and attention-based methods, especially at high pruning ratios, contrary to expectation. For instance, with a pruning ratio ° of 0.95 on InfoVQA, random pruning yields 3.9% and 19.6% higher NDCG@5 ° than score- and attention-based strategies, respectively (Ma et al., 5 Jun 2025 ° ). However, no pruning method ° supports order-of-magnitude reduction in memory usage without severe quality loss. The cause is that the relevance of patch embeddings is highly query-dependent, and query-agnostic pruning will inevitably remove critical information for some queries.

2. Token Merging

Token merging ° aggregates groups of patch embeddings to form a smaller set of vectors, potentially preserving essential semantic and visual content °. The strategies evaluated include:

1D Spatial Pooling: Merges tokens in sequential (flattened) order.
2D Spatial Pooling: Merges spatially adjacent embeddings within the page layout.
Semantic Clustering °: Groups embeddings by cosine similarity ° in the final representation space, then averages them.

Semantic clustering, performed at the last stage of the pipeline (after projection into the retrieval embedding space), combined with fine-tuning on merged representations, provides the most effective trade-off. Fine-tuning is critical for recovery of performance at high compression rates °, restoring 61–67% of the accuracy lost due to merging (Ma et al., 5 Jun 2025 ° ).

Empirical Performance and Trade-Offs

Retrieval quality is measured by NDCG@5, and memory footprint is expressed relative to the single-vector DSE ° baseline. Key results (Ma et al., 5 Jun 2025 ° ) are summarized below:

Model	Memory Use (vs. DSE)	NDCG@5 (relative)	Method
ColQwen2 (original)	64.4×	100%	Full patch-level embeddings
Light-ColQwen2 (mf=9)	7.6×	98.2%	Semantic clustering, fine-tuned, merging factor 9
Light-ColQwen2 (mf=49)	1.8×	94.6%	As above, with greater compression
DSE-Qwen2 (single vector)	1.0×	91%	Single page embedding

With a merging factor of 9, Light-ColQwen2 reduces memory usage to 11.8% of the original (7.6× DSE), while retaining 98.2% of retrieval effectiveness.
At a merging factor of 49 (approximate 2% of original memory), it preserves 94.6% of NDCG@5—outperforming the DSE-Qwen2 single-vector baseline (Ma et al., 5 Jun 2025 ° ).
Similar results hold for the PaliGemma-3B backbone, with comparable reductions in memory and minimal accuracy loss.

Fine-tuning after merging is essential, as it recovers the majority of lost performance, particularly at higher compression rates (Ma et al., 5 Jun 2025 ° ).

Implementation Considerations

Computational Overhead: Semantic clustering and merging add 3–3.5 hours to training and about 1 minute per 500 pages during offline embedding generation, which is moderate relative to full indexing or fine-tuning (Ma et al., 5 Jun 2025 ° ).
Dataset Sensitivity: Performance is most robust on visually sparse documents; more significant drops occur for densely populated, text-heavy pages (e.g., DocVQA, TAT-DQA) when using aggressive merging.
LaTeX similarity function: For late-interaction retrieval:

$s(q, p) = \sum_j \max_i \left( e_p^i \right)^T e_q^j$

where $e_q^j$ are the query token embeddings ° and $e_p^i$ are the (possibly merged) patch embeddings (Ma et al., 5 Jun 2025 ° ).

Practical Impact and Open Directions

Light-ColPali/ColQwen2 sets a new competitive baseline for storage-efficient VDR, attaining order-of-magnitude reductions in memory consumption with minimal loss in retrieval effectiveness. The methods are validated on nine datasets and two major VDR backbones, demonstrating robustness across a range of visual document types ° (Ma et al., 5 Jun 2025 ° ).

Suggested Areas for Future Research

Adaptive Merging: Developing mechanisms to dynamically set the merging factor based on document density or complexity.
Complementary Compression: Exploring integration with dimensionality reduction, quantization, data cleaning, or model distillation ° for further efficiency gains.
Generalization: Evaluating transferability to other multi-vector retrieval tasks beyond VDR.
Scalability: Leveraging these memory reductions for deployment in truly large-scale or web-scale document retrieval environments.

Limitations

Token pruning, even when employing query-informed or attention-based heuristics, is fundamentally limited for VDR; the random method’s modest success owes to redundancy in patch grouping but does not suffice for high compression.
Small but non-negligible accuracy losses exist at the highest compression rates and on data with very dense information layouts, requiring practitioners to set the merging factor according to their own quality requirements (Ma et al., 5 Jun 2025 ° ).

Speculative Note

A possible future line, not directly tested in the cited paper, is the use of light-weight, query-time token selection ° or hybrid compression methods that combine semantic clustering with other techniques such as vector quantization ° or sequence distillation.

References

All technical details, experimental findings, and claims are derived from "Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings" (Ma et al., 5 Jun 2025 ° ). See also the associated GitHub repository: https://github.com/illuin-tech/colpali.