Bi-Encoder Cascades: Efficient Multi-Stage Retrieval
- Bi-Encoder Cascades are multi-stage frameworks that use a series of encoders with increasing cost to filter and refine retrieval candidates.
- They start with a fast, low-cost encoder to precompute static representations, then selectively apply stronger models to a narrowed candidate set.
- Experiments demonstrate up to 6× lifetime cost reduction and halved cold-start latency while preserving state-of-the-art Recall@k performance.
A bi-encoder cascade is a multi-stage retrieval or inference framework that sequentially combines several bi-encoder models, each with increasing computational cost and representational power, to minimize total system cost while preserving end-task accuracy. In contrast to prevalent cross-encoder-based cascades, which require expensive joint pairwise computation at query time, bi-encoder cascades leverage the ability to precompute static representations for one modality (e.g., images or acoustic frames), allowing selective refinement over a constrained subset of candidates. The small-world cascade regime, introduced in the context of text–image retrieval, formalizes scenarios where a limited fraction of the corpus repeatedly attracts most of the queries, thus concentrating expensive computation on a small, dynamically identified subset of items. Cascade architectures are now studied for retrieval, automatic speech recognition (ASR), and other multimodal sequence mapping tasks (Hönig et al., 2023, Narayanan et al., 2020).
1. Formalization and Motivation of Bi-Encoder Cascades
Classically, a bi-encoder (BE) system deploys two independent encoders: an image encoder and a text encoder . At build time, all are transformed to cached embeddings . At query time, a query is embedded as , and retrieval reduces to ranking all images by their cosine similarity . This approach decouples image and text computation, but strong encoders are computationally intensive to run over corpora of realistic scale.
Cascades address this challenge by combining a fast, cheaper BE (e.g., ) with a slower but more accurate BE (). The cheap encoder provides a coarse ranking and selects a top- candidate set per query; the expensive encoder reranks only these candidates. If many images rarely or never appear near the top of any query under , substantial computation is saved since need only process a small sub-corpus on demand (Hönig et al., 2023).
Unlike cascades with final cross-encoder scoring, which merely compare favorably against slow cross-encoder baselines, cascades of bi-encoders offer reductions versus single strong bi-encoder models, as long as the small-world property approximately holds.
2. The Small-World Search Regime
The small-world regime is defined by the property that, under a given retrieval model, the union of top- candidate sets selected across all queries,
is much smaller than the full corpus. That is, for some ,
This assumption encodes the observation that queries, in practice, cluster preferences over particular “hub” items and greatly limits the set receiving the costlier encoder pass.
A plausible implication is that the degree of cost saving depends essentially on the effective for the application domain. Homogeneous or highly popular content domains yield the largest benefit; adversarial or uniformly random query regimes reduce savings.
3. Cascade Construction and Algorithmic Details
A general -level cascade is constructed as a chain of progressively stronger image encoders with increasing per-image inference costs . At each stage , a cut-off is chosen (), interleaving stages of coarse filtering and selective refinement.
The canonical algorithm can be written as:
1 2 3 4 5 6 7 8 9 |
for d in D: cache_small[d] = I_small(d) Top = Top-m_1 images in D by s(cache_small[d], T(q)) for j in range(1, r+1): for d in Top[1:m_j]: if cache_j[d] is empty: cache_j[d] = I_j(d) Top = Top-m_{j+1} (or Top-k if j==r) in Top[1:m_j], ranked by s(cache_j[d], T(q)) return Top[1:k] |
Key hyper-parameters are the depth of the cascade and the sequence of cut-offs , controlling the recall/cost trade-off. Per-query costs comprise for initial filtering (often approximated with ANN) and for successive encoder passes (Hönig et al., 2023).
4. Lifetime Cost Model
Let denote the per-image inference cost at level . For a single-encoder BE baseline, all images are encoded once at build time: . Under the -small-world model with an -level cascade: This assumes all images are initially encoded by the cheapest encoder and only a fraction reach stronger encoders over the system lifetime. The cost saving factor is then
For (2-level), . Deeper cascades can further reduce early-query latency (cold cache), with speedup
if .
5. Experimental Setup and Results
Evaluation on Flickr30k (1,000 images) and MSCOCO val (5,000 images), using Recall@, lifetime cost reduction (with , ), and early-query latency reduction for three-stage cascades (, ). Model backbones included OpenCLIP ViT-B/16, ViT-L/14, ViT-g/14; OpenCLIP ConvNeXt-B/L/XXL; BLIP-B and BLIP-L. Per-image cost was computed as #MACs via PyTorch-OpCounter (Hönig et al., 2023).
Key quantitative results from Flickr30k:
| Cascade | Recall@1 | Recall@5 | Recall@10 | Recall@1 | |
|---|---|---|---|---|---|
| XXL baseline | 75.0 | 93.8 | 96.4 | 1.0× | 0 |
| B → XXL | 76.4 | 93.7 | 96.3 | 5.0× | +1.4 |
| B → L → XXL | 75.4 | 93.5 | 96.7 | 4.5× | +0.4 |
| L → XXL | 75.0 | 93.4 | 96.8 | 3.1× | 0 |
| L baseline | 73.8 | 92.2 | 96.1 | 4.4× | –1.2 |
| B baseline | 69.2 | 89.0 | 94.1 | 9.9× | –5.8 |
All cascades matched the strong XXL baseline’s Recall@k within ±0.5 points, except for cheap single-level baselines. Two-level cascades achieved up to 6× lifetime cost reduction with no loss in Recall@k, and three-level cascades halved cold-start latency compared to two-level alternatives.
Similar patterns were observed for ViT and BLIP backbones and MSCOCO.
6. Insights, System Limitations, and Extension Proposals
Bi-encoder cascades capitalize on the empirical “small-world” effect in retrieval workloads, efficiently focusing expensive computation where it is recurrently needed. Deeper cascades can substantially reduce initial query latency by substituting many cheap encoder passes for fewer strong ones without sacrificing recall. Limitations include the assumption of stationary query distributions and fixed , the necessity of tuning stage cut-offs (), increased memory for per-stage caches, and a primary focus on recall rather than other metrics (e.g., end-to-end latency under high QPS, precision@k).
Proposed extensions include adaptive per-query or learned stopping rules (e.g., halting at different cascade depths dynamically), integration with classical ANN indices or cross-encoders (“heterogeneous” cascades), application to video retrieval and multilingual tasks, and joint optimization with quantization or distillation methods (Hönig et al., 2023).
7. Comparative Context: Cascaded Encoders beyond Retrieval
A related bi-encoder cascade motif has been applied in end-to-end ASR to unify streaming and non-streaming modes (Narayanan et al., 2020). Here, a causal “streaming” encoder processes input features online, serving latency-sensitive applications; a non-causal “non-streaming” encoder further refines ’s outputs for higher quality with greater compute and latency. Both encoders share a single RNN-T decoder. Empirically, streaming WER matched or improved over baselines, while non-streaming operation improved WER by 10–27% relative across tests (e.g., VS 5.1%, T-AB 3.3%), showing that cascading also improves two-pass sequence-to-sequence models (Narayanan et al., 2020).
This suggests bi-encoder cascades are broadly applicable as a resource-efficient method for multimodal retrieval and sequence mapping in both retrieval and generation tasks.
References:
(Hönig et al., 2023): "Bi-Encoder Cascades for Efficient Image Search" (Narayanan et al., 2020): "Cascaded encoders for unifying streaming and non-streaming ASR"