Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions

Published 11 Apr 2026 in cs.IR, cs.CL, and cs.LG | (2604.09982v1)

Abstract: Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop of 86-97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8-point gap due to ConstBERT's sparse centroid coverage, and fine-tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi-vector-reproducibility.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a five-dimensional diagnostic framework to analyze reproducibility beyond standard benchmarks.
It demonstrates that backend configurations and fixed-pooling in ConstBERT significantly impact retrieval effectiveness under query distribution shifts.
Empirical evaluations reveal that fine-tuning fails to overcome architectural ceilings posed by uniform token weighting in multi-vector models.

Reproducibility Beyond Benchmarks in Multi-Vector Retrieval: ConstBERT and ColBERT-v2 Across Backends and Query Distributions

Introduction

This paper critically assesses reproducibility and architectural generalization in multi-vector neural IR models, with a particular focus on ConstBERT and ColBERT-v2. The analysis extends beyond conventional benchmark numerical reproduction by scrutinizing infrastructure dependencies, domain adaptation, query structure variation, and the interplay between model architecture and backend configurations. The work introduces a five-dimensional diagnostic framework to characterize reproducibility not merely as matching headline results, but as a probe of architectural and systemic robustness.

Multi-Vector Retrieval: Architectural Overview

ColBERT and its derivatives typify late-interaction neural ranking, representing queries/documents as sets of token-level embeddings, scored by the MaxSim operator, $s(q, d) = \sum_{i=1}^{|q|} \max_{j=1}^{|d|} \mathbf{q}_i^\top \mathbf{d}_j$ . ColBERT-v2 furthers ColBERT with residual compression and distilled supervision, producing variable-length, per-token document representations. ConstBERT instead pools every document into exactly 32 vectors, trading off expressiveness for storage efficiency.

Architecturally, late-interaction models gain granularity and semantic matching power over single-vector DR, and have demonstrated robust performance on benchmarks such as MS-MARCO and BEIR—although these benchmarks are structurally homogeneous, dominated by short, well-formed queries.

Implementation Correctness and Benchmark Reproduction

Reproducibility on MS-MARCO using officially released checkpoints (and FAISS-IVF as the retrieval engine) validates implementation correctness for both ConstBERT and ColBERT-v2. Main effectiveness numbers—MRR@10 within 0.05% for ConstBERT (38.99% vs 39.04% reported); ColBERT-v2 within 0.55%—are matched. This establishes that system configuration and experiment pipeline align with prior work, providing a solid basis for further architectural analysis.

The authors emphasize that such reproduction alone is insufficient; it does not guarantee architectural robustness or functional generalizability, particularly because evaluation is confined to the narrow distributional regime exemplified by MS-MARCO.

Retrieval Backend Sensitivity and Infrastructure Dependencies

A critical discovery is the system-level dependency of model effectiveness on backend configurations. While FAISS-IVF reproduces ConstBERT headline results, PLAID (the engine used in ConstBERT’s original experiments) systematically underperforms: MRR@10 drops by approximately 8–9 points (over 20% relative loss). Exhaustive parameter search fails to close this gap, implicating undocumented (potentially proprietary) configurations in the original results.

Centroid coverage analysis reveals that ConstBERT’s 32-vector representation yields sparse centroid footprints in PLAID’s indexing structure; on average only 12.1/32 centroids are occupied per document, undermining candidate coverage and, by extension, top-k retrieval quality. Therefore, the system's performance is not only a function of the model architecture, but also tightly coupled to backend-engine design and configuration—a reproducibility failure mode absent from prior studies with less aggressive architectural compression.

Backend sensitivity is further magnified under distributional shift: on ToT’s long narrative queries, PLAID performance on ConstBERT collapses to below 1% MRR@10, compared to FAISS-IVF’s 4.27%.

Domain Generalization: From MS-MARCO to BEIR

Testing on BEIR, which stress-tests domain/topic transfer but retains short query structure, provides further insight into architectural generalization. ConstBERT with PLAID drops from its reported 46.8% mean nDCG@10 to 37.4% (–20%), whereas ColBERT-v2 retains robustness (within 1.4%). Using FAISS-IVF, ConstBERT’s degradation is even more pronounced (29.3%, –17.5% relative).

This asymmetry demonstrates that compressed, fixed-length multi-vector representations are more brittle under topic shift, particularly when backend configurations optimized for one architecture (and query style) are deployed for another. Learned pooling, while scaling index size, discards key nuances needed for new domains, in contrast to variable-length per-token approaches.

Structural Generalization: Query Distribution Shift and Failure Modes

The most striking findings come from evaluation on the TREC ToT dataset, which comprises long, ambiguous narrative queries (median 121 words). These pose a severe distributional shift compared to MS-MARCO's 6-word median queries. Both ConstBERT and ColBERT-v2 collapse: 86–97% drops in MRR@10, frequently falling below BM25’s recall.

Ablation on query length reveals that performance saturates at 20 token queries—additional context (reaching up to 121 tokens) confers no gain, with the MaxSim scoring function plateauing due to uniform token weighting. Architectural analysis via brute-force MaxSim computation (bypassing all approximation) shows that model quality cannot exceed about 5% MRR@10. This establishes that the collapse arises from a mathematical property of the MaxSim operator itself: it indiscriminately sums over all tokens, giving filler/hedge tokens (which dominate in narrative queries) equal weight to informative terms, thereby diluting the similarity signal and overwhelming the model's discriminative capacity.

Adaptation Potential and Architectural Ceilings

Fine-tuning both ColBERT-v2 and ConstBERT on ToT data (with up to 3× more labeled data than the baseline) not only fails to close the gap, but further degrades performance by up to 29%. This result demonstrates that adaptation via data or gradient descent cannot overcome the architectural limitation imposed by uniform MaxSim: the inability to reweight tokens dynamically, or otherwise discount filler/irrelevant segments, is baked into the scoring paradigm.

Implications and Theoretical Impact

This research redefines reproducibility for neural IR as a multi-faceted, system-level property. It identifies backend configuration as a first-class artifact and reveals that architectural properties—such as representation compactness and scoring operator selection—induce hard ceilings on generalization that cannot be resolved by additional adaptation or larger datasets. Standard benchmarks are shown to insufficiently challenge model robustness, as they systematically exclude structurally diverse query distributions present in many realistic retrieval applications.

From a theoretical perspective, this work exposes that late-interaction multi-vector retrieval, when combined with uniform token weighting, lacks the flexibility required for out-of-distribution generalization—especially for verbose, conversational, or ambiguous queries. This diagnostic finding has immediate implications: further gains in practical retrieval will require architectures with adaptive or learned token weighting mechanisms, possibly integrating query reduction or term importance predictors directly into the late-interaction pipeline.

Conclusion

This paper decisively shows that multi-vector retrieval architectures such as ConstBERT and ColBERT-v2, while numerically reproducible on traditional benchmarks, fail architecturally under systematic stress testing along backend, domain, and query structure axes. Backend configurations (and their documentation) are integral to reproducibility. Efficiency-oriented design choices, such as fixed-pooling, introduce sensitivity that destabilizes generalization, while core scoring mechanisms like uniform MaxSim are fundamentally incompatible with verbose, narrative information needs. Fine-tuning is not a panacea; where architecture induces an adaptation ceiling, only architectural innovation (e.g., non-uniform token weighting, structure-aware interaction models) offers a viable path forward.

Developing and validating robust, generalizable neural retrieval systems will require new benchmarks that probe structural variance and richer diagnostic practice, moving beyond the narrow paradigm of matching metrics to a holistic understanding of where, and why, architectures succeed or fail.

Markdown Report Issue