Comparative Retriever with CMC Framework
- Comparative Retriever (CR) is a retrieval reranking method that leverages the CMC framework to jointly compare multiple candidate embeddings using shallow self-attention.
- It integrates between bi-encoder and cross-encoder stages to improve recall by 3.5%-4.8%-points on benchmarks like ZeSHEL while incurring minimal latency.
- The CMC framework employs a scalable Transformer architecture that enhances throughput and precision in tasks such as entity linking and dialogue ranking.
A Comparative Retriever (CR) implemented via the Comparing Multiple Candidates (CMC) framework is an advanced reranking architecture designed to simultaneously improve retrieval recall and top-1 ranking efficiency within classic retrieve-and-rerank paradigms. CMC enables joint, contextualized comparison of a query with multiple candidate embeddings through shallow self-attention, bridging the speed–accuracy gap between bi-encoder (BE) and cross-encoder (CE) methods by directly modeling their intermediate candidate sets with a scalable Transformer. Designed for deployability as either an intermediate reranker or a final-stage scoring mechanism, CMC supports high-throughput settings, outperforming traditional pipelines on a range of retrieval tasks in both accuracy and end-to-end latency (Song et al., 2024).
1. Pipeline Integration and Function
Typical retrieval pipelines consist of a fast bi-encoder (BE) for large-scale candidate retrieval, followed by a slower, more expressive cross-encoder (CE) reranker for a limited candidate pool. The CMC layer is introduced between BE and CE:
- Stage 1: Bi-Encoder Retrieval The query and all candidate embeddings are processed with a BE, with Maximum Inner Product Search (MIPS) returning the top-%%%%2%%%% candidates as .
- Stage 2: CMC Reranking CMC ingests the single query and all candidate embeddings, jointly encoding them via shallow self-attention to compute contextualized similarity scores for all candidates in parallel.
- Stage 3: Optional Cross-Encoder Final Rerank The highest-scoring () candidates from CMC may be forwarded to a CE for final strict reranking.
This yields an enhanced pipeline: BE CMC CE, where CMC supplies a "virtually enhanced" retrieval stage, producing improved recall metrics (e.g., R@16+4.8%-p, R@64+3.5%-p on ZeSHEL) with negligible latency increase (7%) and, as a standalone reranker, is both faster (11) and more effective on specific downstream tasks (+0.7%-p on Wikipedia EL, +3.3 MRR on DSTC7 dialogue ranking) relative to CE (Song et al., 2024).
2. Mathematical Formulation
CMC is composed of several key formal elements:
2.1 Sentence-level Encodings
A query and each candidate are encoded as follows:
- Query:
- Candidate:
With two separate encoders and , extract single-vector summaries:
2.2 Multi-Candidate Self-Attention
Stack query and candidate embeddings:
Process via Transformer encoder layers (no positional encoding, , , ):
with .
2.3 Scoring & Selection
Score each contextualized candidate against the contextualized query:
Select top-1:; for further reranking, pass top- by downstream.
2.4 Training Loss
Employ a multi-class cross-entropy over candidates:
where . Optionally, include a KL regularizer to bias toward the BE's retrieval distribution :
3. Theoretical and Empirical Efficiency
CMC offers a favorable tradeoff between computational complexity and retrieval effectiveness:
| Approach | Typical Cost per Query | Latency (Empirical) |
|---|---|---|
| BE | $570$ ms (K=512, ZeSHEL) | |
| CE | $260$ ms (K=64, ZeSHEL) | |
| CMC | $37$ ms (K=512, ZeSHEL) |
- Complexity: For moderate (e.g., ), (CMC) versus for typical CE token lengths ().
- Empirical Throughput: CMC can process up to $16,000$ candidates in a single batch, limited by GPU memory, unlike CE which scales poorly beyond a few dozen documents.
- Latency: On ZeSHEL, BE+CMC () requires $607$ ms ( BE alone); CMC-filtered 16 + CE takes $160$ ms ( CE@64).
4. Empirical Evaluation and Benchmarks
CMC achieves notable improvements across a spectrum of retrieval and ranking tasks:
4.1 ZeSHEL Retrieval (Recall@K)
- BE@64 Recall: 87.95%
- BE+CMC@64 Recall: 91.51% (%-p)
- BE+CMC@16 Recall: 86.32% vs. BE@16: 81.52% (%-p)
4.2 Downstream Accuracy
- Wikipedia EL (AIDA/MSNBC/WNED-CWEB avg):
- CE: 80.2%-p
- CMC: 80.9%-p (%-p)
- DSTC7 Dialogue (MRR@10):
- CE: 73.2
- CMC: 76.5 ( MRR)
All improvements are statistically significant at .
4.3 Datasets and Hyperparameters
- Entity Linking: AIDA-CoNLL, WNED-CWEB, MSNBC, ZeSHEL ($10$ k–$100$ k entities)
- Passage Ranking: MS MARCO ($8.8$ M passages)
- Dialogue: DSTC7 track1 ($100$ candidates)
CMC configurations: BERT-base/large encoders (e.g., BLINK/CoCondenser), 2-layer/$12$-head/$768$-dim self-attention, max query‑lengths $32$/$128$, max doc length $128$/$512$; learning rate in , batch $4$–$8$, epochs $3$–$10$, with hard negatives from BE.
5. Engineering and Practical Deployment
For insertion into BE→CE pipelines:
- Offline corpus preparation: Encode all items via , store for lookup.
- Online per-query retrieval:
- Encode using .
- BE retrieves top- candidate IDs.
- Retrieve precomputed .
- Concatenate with to form , process through CMC's Transformer.
- Optionally invoke CE on CMC's top- outputs.
Parameter guidance:
- in recommended; CMC robust up to .
- 2 self-attention layers, 12 heads, are sufficient.
- Use skip connections per self-attention block for training stability.
6. Training and Inference Workflows
Training pseudocode:
1 2 3 4 5 6 7 8 9 10 11 |
for each batch of queries {q_i}: for each q_i: C_i = BE.topK(q_i) # indices of hard negs + gold h_q = Enc_q(q_i)[CLS] H_c = [Enc_c(c)[CLS] for c in C_i] # precompute offline H0 = concat( h_q, H_c ) # shape=(K+1,d) H = TransformerEncoder(H0) # L=2 layers scores = [ dot(H[0], H[j]) for j=1..K ] loss = CrossEntropyLoss(scores, gold_idx) \ + λ KL(softmax(scores) ∥ BE_scores) backpropagate & update parameters |
Inference pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
input: query q candidates = BE.topK(q) # e.g., K=512 h_q = Enc_q(q)[CLS] H_c = lookup( candidates ) # precomputed embeddings H0 = concat( h_q, H_c ) H = TransformerEncoder(H0) scores = [ dot(H[0], H[j]) for j=1..K ] if final rerank: topKprime = argsort(scores)[-K'..] return CE.rerank(q, topKprime) else: return argsort(scores) # new top list |
CMC thus unifies the scalability of BE retrieval with the contextual discrimination of cross-attention, offering a high-throughput comparative retriever suitable for modern large-vocabulary ranking tasks (Song et al., 2024).