Bi-Encoder Cascades: Efficient Multi-Stage Retrieval

Updated 1 December 2025

Bi-Encoder Cascades are multi-stage frameworks that use a series of encoders with increasing cost to filter and refine retrieval candidates.
They start with a fast, low-cost encoder to precompute static representations, then selectively apply stronger models to a narrowed candidate set.
Experiments demonstrate up to 6× lifetime cost reduction and halved cold-start latency while preserving state-of-the-art Recall@k performance.

A bi-encoder cascade is a multi-stage retrieval or inference framework that sequentially combines several bi-encoder models, each with increasing computational cost and representational power, to minimize total system cost while preserving end-task accuracy. In contrast to prevalent cross-encoder-based cascades, which require expensive joint pairwise computation at query time, bi-encoder cascades leverage the ability to precompute static representations for one modality (e.g., images or acoustic frames), allowing selective refinement over a constrained subset of candidates. The small-world cascade regime, introduced in the context of text–image retrieval, formalizes scenarios where a limited fraction of the corpus repeatedly attracts most of the queries, thus concentrating expensive computation on a small, dynamically identified subset of items. Cascade architectures are now studied for retrieval, automatic speech recognition (ASR), and other multimodal sequence mapping tasks (Hönig et al., 2023, Narayanan et al., 2020).

1. Formalization and Motivation of Bi-Encoder Cascades

Classically, a bi-encoder (BE) system deploys two independent encoders: an image encoder $I\colon D\to\mathbb{R}^d$ and a text encoder $T\colon Q\to\mathbb{R}^d$ . At build time, all $d\in D$ are transformed to cached embeddings $\mathbf{d}=I(d)$ . At query time, a query $q$ is embedded as $\mathbf{q}=T(q)$ , and retrieval reduces to ranking all images by their cosine similarity $s(\mathbf{q}, \mathbf{d})$ . This approach decouples image and text computation, but strong encoders are computationally intensive to run over corpora of realistic scale.

Cascades address this challenge by combining a fast, cheaper BE (e.g., $I_{\text{small}}$ ) with a slower but more accurate BE ( $I_{\text{large}}$ ). The cheap encoder provides a coarse ranking and selects a top- $m$ candidate set per query; the expensive encoder reranks only these candidates. If many images rarely or never appear near the top of any query under $I_{\text{small}}$ , substantial computation is saved since $I_{\text{large}}$ need only process a small sub-corpus on demand (Hönig et al., 2023).

Unlike cascades with final cross-encoder scoring, which merely compare favorably against slow cross-encoder baselines, cascades of bi-encoders offer reductions versus single strong bi-encoder models, as long as the small-world property approximately holds.

2. The Small-World Search Regime

The small-world regime is defined by the property that, under a given retrieval model, the union of top- $m$ candidate sets selected across all queries,

$\bigcup_{i=1}^{\infty} \mathcal{N}_m(q_i), \quad \mathcal{N}_m(q_i) = \{\text{the }m\text{ images in }D\text{ with largest }s(I_{\small}(d), T(q_i))\},$

is much smaller than the full corpus. That is, for some $p\in(0,1)$ ,

$\Bigl|\bigcup_{i=1}^{\infty} \mathcal{N}_m(q_i)\Bigr| \le p\,|D|.$

This assumption encodes the observation that queries, in practice, cluster preferences over particular “hub” items and greatly limits the set receiving the costlier encoder pass.

A plausible implication is that the degree of cost saving depends essentially on the effective $p$ for the application domain. Homogeneous or highly popular content domains yield the largest benefit; adversarial or uniformly random query regimes reduce savings.

3. Cascade Construction and Algorithmic Details

A general $(r+1)$ -level cascade is constructed as a chain of progressively stronger image encoders $I_{\small}, I_1, \ldots, I_r$ with increasing per-image inference costs $c_{\small} < c_1 < \cdots < c_r$ . At each stage $j$ , a cut-off $m_j$ is chosen ( $m_1 > m_2 > \cdots > m_r$ ), interleaving stages of coarse filtering and selective refinement.

The canonical algorithm can be written as:

for d in D:
    cache_small[d] = I_small(d)
Top = Top-m_1 images in D by s(cache_small[d], T(q))
for j in range(1, r+1):
    for d in Top[1:m_j]:
        if cache_j[d] is empty:
            cache_j[d] = I_j(d)
    Top = Top-m_{j+1} (or Top-k if j==r) in Top[1:m_j], ranked by s(cache_j[d], T(q))
return Top[1:k]

Key hyper-parameters are the depth $r$ of the cascade and the sequence of cut-offs $m_1, \dots, m_r$ , controlling the recall/cost trade-off. Per-query costs comprise $O(|D|\log|D|)$ for initial filtering (often approximated with ANN) and ${\sum_{j=1}^r m_jc_j}$ for successive encoder passes (Hönig et al., 2023).

4. Lifetime Cost Model

Let $c_j$ denote the per-image inference cost at level $j$ . For a single-encoder BE baseline, all images are encoded once at build time: $C_{1\text{-level}}=|D|c_r$ . Under the $p$ -small-world model with an $(r+1)$ -level cascade: $C_{r+1\text{-level}} = |D|\left(c_{\small} + p\sum_{j=1}^r c_j\right)$ This assumes all images are initially encoded by the cheapest encoder and only a fraction $p$ reach stronger encoders over the system lifetime. The cost saving factor is then

$F_{\text{life}} = \frac{c_r}{c_{\small} + p\sum_{j=1}^r c_j}$

For $r=1$ (2-level), $F_{\text{life}}=c_{\large}/(c_{\small} + p\,c_{\large})$ . Deeper cascades can further reduce early-query latency (cold cache), with speedup

$F_{\text{latency}} = \frac{m_{\large}c_{\large}}{\sum_{j=1}^r m_jc_j}$

if $m_r \ll m_{\large}$ .

5. Experimental Setup and Results

Evaluation on Flickr30k (1,000 images) and MSCOCO val (5,000 images), using Recall@ $\{1,5,10\}$ , lifetime cost reduction $F_{\text{life}}$ (with $p=0.1$ , $m_1=50$ ), and early-query latency reduction $F_{\text{latency}}$ for three-stage cascades ( $m_2=14$ , $F_{\text{latency}}\approx2$ ). Model backbones included OpenCLIP ViT-B/16, ViT-L/14, ViT-g/14; OpenCLIP ConvNeXt-B/L/XXL; BLIP-B and BLIP-L. Per-image cost $c_j$ was computed as #MACs via PyTorch-OpCounter (Hönig et al., 2023).

Key quantitative results from Flickr30k:

Cascade	Recall@1	Recall@5	Recall@10	$F_{\text{life}}$	$\Delta$ Recall@1
XXL baseline	75.0	93.8	96.4	1.0×	0
B → XXL	76.4	93.7	96.3	5.0×	+1.4
B → L → XXL	75.4	93.5	96.7	4.5×	+0.4
L → XXL	75.0	93.4	96.8	3.1×	0
L baseline	73.8	92.2	96.1	4.4×	–1.2
B baseline	69.2	89.0	94.1	9.9×	–5.8

All cascades matched the strong XXL baseline’s Recall@k within ±0.5 points, except for cheap single-level baselines. Two-level cascades achieved up to 6× lifetime cost reduction with no loss in Recall@k, and three-level cascades halved cold-start latency compared to two-level alternatives.

Similar patterns were observed for ViT and BLIP backbones and MSCOCO.

6. Insights, System Limitations, and Extension Proposals

Bi-encoder cascades capitalize on the empirical “small-world” effect in retrieval workloads, efficiently focusing expensive computation where it is recurrently needed. Deeper cascades can substantially reduce initial query latency by substituting many cheap encoder passes for fewer strong ones without sacrificing recall. Limitations include the assumption of stationary query distributions and fixed $p$ , the necessity of tuning stage cut-offs ( $m_1,\dots,m_r$ ), increased memory for per-stage caches, and a primary focus on recall rather than other metrics (e.g., end-to-end latency under high QPS, precision@k).

Proposed extensions include adaptive per-query or learned stopping rules (e.g., halting at different cascade depths dynamically), integration with classical ANN indices or cross-encoders (“heterogeneous” cascades), application to video retrieval and multilingual tasks, and joint optimization with quantization or distillation methods (Hönig et al., 2023).

7. Comparative Context: Cascaded Encoders beyond Retrieval

A related bi-encoder cascade motif has been applied in end-to-end ASR to unify streaming and non-streaming modes (Narayanan et al., 2020). Here, a causal “streaming” encoder $E_s$ processes input features online, serving latency-sensitive applications; a non-causal “non-streaming” encoder $E_{ns}$ further refines $E_s$ ’s outputs for higher quality with greater compute and latency. Both encoders share a single RNN-T decoder. Empirically, streaming WER matched or improved over baselines, while non-streaming operation improved WER by 10–27% relative across tests (e.g., VS 5.1%, T-AB 3.3%), showing that cascading also improves two-pass sequence-to-sequence models (Narayanan et al., 2020).

This suggests bi-encoder cascades are broadly applicable as a resource-efficient method for multimodal retrieval and sequence mapping in both retrieval and generation tasks.

References:

(Hönig et al., 2023): "Bi-Encoder Cascades for Efficient Image Search" (Narayanan et al., 2020): "Cascaded encoders for unifying streaming and non-streaming ASR"

PDF Markdown Chat (Pro)

References (2)

Bi-Encoder Cascades for Efficient Image Search (2023)

Cascaded encoders for unifying streaming and non-streaming ASR (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bi-Encoder Cascades.