Abstract: Multi-vector retrieval methods, exemplified by the ColBERT architecture, have shown substantial promise for retrieval by providing strong trade-offs in terms of retrieval latency and effectiveness. However, they come at a high cost in terms of storage since a (potentially compressed) vector needs to be stored for every token in the input collection. To overcome this issue, we propose encoding documents to a fixed number of vectors, which are no longer necessarily tied to the input tokens. Beyond reducing the storage costs, our approach has the advantage that document representations become of a fixed size on disk, allowing for better OS paging management. Through experiments using the MSMARCO passage corpus and BEIR with the ColBERT-v2 architecture, a representative multi-vector ranking model architecture, we find that passages can be effectively encoded into a fixed number of vectors while retaining most of the original effectiveness.
Multi-vector retrieval models, such as ColBERT, represent documents using multiple vector embeddings, typically one per token. While effective, this approach incurs substantial storage overhead, as the index size scales linearly with the total number of tokens in the collection. The paper "Efficient Constant-Space Multi-Vector Retrieval" (MacAvaney et al., 2 Apr 2025) introduces ConstBERT, a method designed to mitigate this storage cost by ensuring each document is represented by a fixed number of vectors, irrespective of its original length.
ConstBERT Methodology
The central idea of ConstBERT is to decouple the number of stored document vectors from the number of input tokens. Standard ColBERT computes a relevance score between a query q (with N token embeddings q1,…,qN) and a document d (with M token embeddings d1,…,dM) using a late interaction mechanism:
s(q,d)=∑i=1Nmaxj=1,…,MqiTdj
Here, each query embedding qi interacts with all M document embeddings dj. ConstBERT modifies this by projecting the M original document embeddings into a fixed set of C new embeddings, denoted as δ1,…,δC, where C is a predetermined hyperparameter (e.g., 16, 32, 64) and typically C≪M.
This projection is implemented via an additional linear layer introduced during training. The input to this layer is the concatenation of the document's token embeddings [d1∣⋯∣dM], derived from the base multi-representation model (e.g., ColBERT). A learned weight matrix W∈RMk×Ck (where k is the embedding dimension) transforms this input into the C fixed embeddings:
[δ1∣⋯∣δC]=WT[d1∣⋯∣dM]
The parameters of this projection layer W are learned end-to-end concurrently with the parameters of the base model. The resulting δj vectors maintain the original embedding dimension k but are learned representations that summarize semantic facets of the document, rather than corresponding directly to individual tokens.
The scoring function is adapted to use these C fixed vectors:
s(q,d)=i=1∑Nj=1,…,CmaxqiTδj
The late interaction paradigm is preserved, but the maximization step for each query vector now occurs over a much smaller, fixed set of C document vectors.
Storage and Implementation Advantages
By representing every document with exactly C vectors of dimension k, the storage requirement per document becomes constant (C×k×sizeof(float)), regardless of the document's token count M. This leads to a theoretical index size reduction factor of approximately M/C compared to standard ColBERT. This reduction is orthogonal to other compression techniques like dimensionality reduction or quantization.
The fixed-size representation per document offers practical advantages beyond simple storage reduction. It simplifies memory management, potentially allowing document representations to align with operating system memory page sizes. This can improve I/O efficiency during retrieval, as fetching document representations becomes more predictable and potentially faster. This contrasts with variable-length representations where managing memory fragmentation and predicting I/O costs is more complex. ConstBERT provides a direct hyperparameter C to control the storage footprint, unlike methods like static pruning (e.g., ColBERTSP) where the resulting size depends on the pruning threshold and distribution of document lengths. However, ConstBERT requires training the projection layer, whereas static pruning can be applied post-hoc.
Experimental Evaluation and Results
ConstBERT was evaluated using ColBERT-v2 as the base architecture. Experiments were conducted on the MSMARCO v1 passage corpus (~8.8M passages) with MSMARCO Dev and TREC Deep Learning (DL) 2019 & 2020 queries, as well as the diverse BEIR benchmark (13 datasets). Performance was measured using MRR@10, NDCG@10, Recall@k, index size, and Mean Response Time (MRT).
MSMARCO and TREC DL Results
On the MSMARCO Dev set, ConstBERT32 (using C=32) achieved an MRR@10 of 0.419, closely matching the ColBERT baseline's 0.421. Similarly, on TREC DL 2019 and 2020, ConstBERT32 yielded NDCG@10 scores (0.729 and 0.720, respectively) nearly identical to ColBERT (0.731 and 0.721). This effectiveness was achieved with a significant index size reduction: ConstBERT32 required only 11GB for the MSMARCO passage index, compared to 22GB for the standard ColBERT baseline. Increasing C to 64 (ConstBERT64) slightly improved effectiveness (0.424 MRR@10 on Dev, 0.737/0.728 NDCG@10 on TREC DL 19/20) but increased the index size to 20GB, demonstrating the trade-off controlled by C.
BEIR Benchmark Results
Across the 13 BEIR datasets, ConstBERT32 and ConstBERT64 generally performed competitively with the ColBERT baseline in terms of NDCG@10. For instance, on the covid dataset, ConstBERT32 (NDCG@10 0.727) slightly outperformed ColBERT (0.720), while achieving substantial storage savings. Similar trends were observed across other BEIR datasets like nfcorpus, scifact, and webis-touche2020, indicating the robustness of the fixed-vector approach across diverse domains. Index size reductions were consistently observed, typically halving the storage requirement for ConstBERT32 compared to ColBERT.
Reranking Performance
ConstBERT's fixed-size representation makes it suitable for second-stage reranking. A two-stage pipeline was tested using ESPLADE (a sparse retriever) for initial candidate generation, followed by reranking using ConstBERT32. This configuration achieved MRR@10 / NDCG@10 scores on MSMARCO Dev (0.414) and TREC DL 19/20 (0.728 / 0.719) very close to the end-to-end PLAID/ColBERT system (0.421 / 0.731 / 0.721). Crucially, this two-stage approach demonstrated significantly lower latency, with an MRT below 6ms, compared to over 50ms for the end-to-end PLAID system executing ColBERT retrieval. The constant number of vectors simplifies the implementation and memory footprint of the reranking stage.
Conclusion
The ConstBERT approach effectively addresses the storage scalability challenge of multi-vector retrieval models. By learning a projection from variable-length token embeddings to a fixed number of representative vectors (C), it achieves a constant storage footprint per document. Experimental results on MSMARCO and BEIR demonstrate that this can be achieved with minimal loss in retrieval effectiveness compared to standard ColBERT, particularly for moderate values of C like 32. This yields substantial reductions in index size (often around 50%) and offers advantages in memory management and computational efficiency, especially notable in fast reranking scenarios.