Papers
Topics
Authors
Recent
2000 character limit reached

Comparative Retriever with CMC Framework

Updated 22 December 2025
  • Comparative Retriever (CR) is a retrieval reranking method that leverages the CMC framework to jointly compare multiple candidate embeddings using shallow self-attention.
  • It integrates between bi-encoder and cross-encoder stages to improve recall by 3.5%-4.8%-points on benchmarks like ZeSHEL while incurring minimal latency.
  • The CMC framework employs a scalable Transformer architecture that enhances throughput and precision in tasks such as entity linking and dialogue ranking.

A Comparative Retriever (CR) implemented via the Comparing Multiple Candidates (CMC) framework is an advanced reranking architecture designed to simultaneously improve retrieval recall and top-1 ranking efficiency within classic retrieve-and-rerank paradigms. CMC enables joint, contextualized comparison of a query with multiple candidate embeddings through shallow self-attention, bridging the speed–accuracy gap between bi-encoder (BE) and cross-encoder (CE) methods by directly modeling their intermediate candidate sets with a scalable Transformer. Designed for deployability as either an intermediate reranker or a final-stage scoring mechanism, CMC supports high-throughput settings, outperforming traditional pipelines on a range of retrieval tasks in both accuracy and end-to-end latency (Song et al., 2024).

1. Pipeline Integration and Function

Typical retrieval pipelines consist of a fast bi-encoder (BE) for large-scale candidate retrieval, followed by a slower, more expressive cross-encoder (CE) reranker for a limited candidate pool. The CMC layer is introduced between BE and CE:

  • Stage 1: Bi-Encoder Retrieval The query qq and all candidate cic_i embeddings are processed with a BE, with Maximum Inner Product Search (MIPS) returning the top-%%%%2%%%% candidates as Cq={cq,1,,cq,K}C_q = \{c_{q,1}, \dotsc, c_{q,K}\}.
  • Stage 2: CMC Reranking CMC ingests the single query qq and all KK candidate embeddings, jointly encoding them via shallow self-attention to compute contextualized similarity scores for all KK candidates in parallel.
  • Stage 3: Optional Cross-Encoder Final Rerank The highest-scoring KK' (K<KK' < K) candidates from CMC may be forwarded to a CE for final strict reranking.

This yields an enhanced pipeline: BE \rightarrow CMC \rightarrow CE, where CMC supplies a "virtually enhanced" retrieval stage, producing improved recall metrics (e.g., Δ\DeltaR@16\approx+4.8%-p, Δ\DeltaR@64\approx+3.5%-p on ZeSHEL) with negligible latency increase (<<7%) and, as a standalone reranker, is both faster (11×\times) and more effective on specific downstream tasks (+0.7%-p on Wikipedia EL, +3.3 MRR on DSTC7 dialogue ranking) relative to CE (Song et al., 2024).

2. Mathematical Formulation

CMC is composed of several key formal elements:

2.1 Sentence-level Encodings

A query qq and each candidate cq,jc_{q,j} are encoded as follows:

  • Query: xq=[[CLS],xq1,,xqm]\mathbf{x}_q = [{\tt [CLS]}, x_q^{1}, \dotsc, x_q^{m}]
  • Candidate: xcq,j=[[CLS],xcq,j1,,xcq,jnj]\mathbf{x}_{c_{q,j}} = [{\tt [CLS]}, x_{c_{q,j}}^{1}, \dotsc, x_{c_{q,j}}^{n_j}]

With two separate encoders Encqry\mathrm{Enc}_{qry} and Enccan\mathrm{Enc}_{can}, extract single-vector summaries:

hqsent=Encqry(xq)[CLS],hcq,jsent=Enccan(xcq,j)[CLS]\mathbf{h}_q^{\rm sent} = \mathrm{Enc}_{qry}(\mathbf{x}_q)_{[CLS]}, \qquad \mathbf{h}_{c_{q,j}}^{\rm sent} = \mathrm{Enc}_{can}(\mathbf{x}_{c_{q,j}})_{[CLS]}

2.2 Multi-Candidate Self-Attention

Stack query and KK candidate embeddings:

H0=[hqsent;hcq,1sent;;hcq,Ksent]R(K+1)×d\mathbf{H}^0 = [ \mathbf{h}_q^{\rm sent};\, \mathbf{h}_{c_{q,1}}^{\rm sent};\, \dotsc;\, \mathbf{h}_{c_{q,K}}^{\rm sent} ] \in \mathbb{R}^{(K+1)\times d}

Process H0\mathbf{H}^0 via LL Transformer encoder layers (no positional encoding, L=2L=2, H=12H=12, d=768d=768):

HCMC=TransformerEncoderL(H0)\mathbf{H}^{\rm CMC} = \mathrm{TransformerEncoder}_{L}(\mathbf{H}^0)

with HCMC=[hqCMC;hcq,1CMC;;hcq,KCMC]\mathbf{H}^{\rm CMC} = [ \mathbf{h}_q^{\rm CMC};\, \mathbf{h}^{\rm CMC}_{c_{q,1}}; \dotsc; \mathbf{h}^{\rm CMC}_{c_{q,K}} ].

2.3 Scoring & Selection

Score each contextualized candidate against the contextualized query:

sj=(hqCMC)(hcq,jCMC),1jKs_j = (\mathbf{h}_q^{\rm CMC}) \cdot (\mathbf{h}_{c_{q,j}}^{\rm CMC})^\top,\quad 1\leq j\leq K

Select top-1:c^q=argmaxjsj\hat c_q = \arg\max_j s_j; for further reranking, pass top-KK' by sjs_j downstream.

2.4 Training Loss

Employ a multi-class cross-entropy over KK candidates:

pj=exp(sj)i=1Kexp(si),LCE=j=1Kyjlogpjp_j = \frac{\exp(s_j)}{\sum_{i=1}^{K} \exp(s_i)}, \qquad \mathcal{L}_{\rm CE} = -\sum_{j=1}^{K} y_j \log p_j

where yj=1[j=g]y_j = \mathbf{1}_{[j=g]}. Optionally, include a KL regularizer to bias toward the BE's retrieval distribution rjr_j:

L=λ1LCE+λ2j=1Kpjlogpjrj,λ1+λ2=1\mathcal{L} = \lambda_1 \mathcal{L}_{\rm CE} + \lambda_2 \sum_{j=1}^K p_j \log\frac{p_j}{r_j}, \qquad \lambda_1+\lambda_2=1

3. Theoretical and Empirical Efficiency

CMC offers a favorable tradeoff between computational complexity and retrieval effectiveness:

Approach Typical Cost per Query Latency (Empirical)
BE O(MIPS)O(\mathrm{MIPS}) $570$ ms (K=512, ZeSHEL)
CE K×O(p2dLCE)K\times O(p^2 d L_{\rm CE}) $260$ ms (K=64, ZeSHEL)
CMC O((K+1)2dLCMC)O((K+1)^2 d L_{\rm CMC}) $37$ ms (K=512, ZeSHEL)
  • Complexity: For moderate KK (e.g., K=64K=64), (K+1)2=4225(K+1)^2=4225 (CMC) versus p2=25,600p^2=25,600 for typical CE token lengths (p=160p=160).
  • Empirical Throughput: CMC can process up to $16,000$ candidates in a single batch, limited by GPU memory, unlike CE which scales poorly beyond a few dozen documents.
  • Latency: On ZeSHEL, BE+CMC (K=512K=512) requires $607$ ms (1.07×\approx1.07\times BE alone); CMC-filtered 16 + CE takes $160$ ms (0.6×\approx0.6\times CE@64).

4. Empirical Evaluation and Benchmarks

CMC achieves notable improvements across a spectrum of retrieval and ranking tasks:

4.1 ZeSHEL Retrieval (Recall@K)

  • BE@64 Recall: 87.95%
  • BE+CMC@64 Recall: 91.51% (Δ=+3.56\Delta=+3.56%-p)
  • BE+CMC@16 Recall: 86.32% vs. BE@16: 81.52% (Δ=+4.80\Delta=+4.80%-p)

4.2 Downstream Accuracy

  • Wikipedia EL (AIDA/MSNBC/WNED-CWEB avg):
    • CE: 80.2%-p
    • CMC: 80.9%-p (+0.7+0.7%-p)
  • DSTC7 Dialogue (MRR@10):
    • CE: 73.2
    • CMC: 76.5 (+3.3+3.3 MRR)

All improvements are statistically significant at p<0.01p<0.01.

4.3 Datasets and Hyperparameters

  • Entity Linking: AIDA-CoNLL, WNED-CWEB, MSNBC, ZeSHEL ($10$ k–$100$ k entities)
  • Passage Ranking: MS MARCO ($8.8$ M passages)
  • Dialogue: DSTC7 track1 ($100$ candidates)

CMC configurations: BERT-base/large encoders (e.g., BLINK/CoCondenser), 2-layer/$12$-head/$768$-dim self-attention, max query‑lengths $32$/$128$, max doc length $128$/$512$; learning rate in {1×105,5×106,2×106}\{1\times10^{-5}, 5\times10^{-6}, 2\times10^{-6}\}, batch $4$–$8$, epochs $3$–$10$, with hard negatives from BE.

5. Engineering and Practical Deployment

For insertion into BE→CE pipelines:

  1. Offline corpus preparation: Encode all items via Enccan\mathrm{Enc}_{can}, store hcsent\mathbf{h}_c^{\rm sent} for lookup.
  2. Online per-query retrieval:
    • Encode qq using Encqry\mathrm{Enc}_{qry}.
    • BE retrieves top-KK candidate IDs.
    • Retrieve precomputed {hcq,jsent}\{\mathbf{h}^{\rm sent}_{c_{q,j}}\}.
    • Concatenate with hqsent\mathbf{h}^{\rm sent}_q to form H0\mathbf{H}^0, process through CMC's Transformer.
    • Optionally invoke CE on CMC's top-KK' outputs.

Parameter guidance:

  • KK in [32,128][32,128] recommended; CMC robust up to K=512K=512.
  • 2 self-attention layers, 12 heads, are sufficient.
  • Use skip connections per self-attention block for training stability.

6. Training and Inference Workflows

Training pseudocode:

1
2
3
4
5
6
7
8
9
10
11
for each batch of queries {q_i}:
  for each q_i:
    C_i = BE.topK(q_i)                     # indices of hard negs + gold
    h_q = Enc_q(q_i)[CLS]
    H_c = [Enc_c(c)[CLS] for c in C_i]      # precompute offline
    H0 = concat( h_q, H_c )                 # shape=(K+1,d)
    H = TransformerEncoder(H0)              # L=2 layers
    scores = [ dot(H[0], H[j]) for j=1..K ]
    loss = CrossEntropyLoss(scores, gold_idx) \
           + λ KL(softmax(scores)  BE_scores)
  backpropagate & update parameters

Inference pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
input: query q
candidates = BE.topK(q)  # e.g., K=512
h_q = Enc_q(q)[CLS]
H_c = lookup( candidates )   # precomputed embeddings
H0 = concat( h_q, H_c )
H = TransformerEncoder(H0)
scores = [ dot(H[0], H[j]) for j=1..K ]
if final rerank:
  topKprime = argsort(scores)[-K'..]
  return CE.rerank(q, topKprime)
else:
  return argsort(scores)  # new top list

CMC thus unifies the scalability of BE retrieval with the contextual discrimination of cross-attention, offering a high-throughput comparative retriever suitable for modern large-vocabulary ranking tasks (Song et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Comparative Retriever (CR).