MICE: Minimal Interaction Cross-Encoders for efficient Re-ranking

Published 18 Feb 2026 in cs.IR | (2602.16299v1)

Abstract: Cross-encoders deliver state-of-the-art ranking effectiveness in information retrieval, but have a high inference cost. This prevents them from being used as first-stage rankers, but also incurs a cost when re-ranking documents. Prior work has addressed this bottleneck from two largely separate directions: accelerating cross-encoder inference by sparsifying the attention process or improving first-stage retrieval effectiveness using more complex models, e.g. late-interaction ones. In this work, we propose to bridge these two approaches, based on an in-depth understanding of the internal mechanisms of cross-encoders. Starting from cross-encoders, we show that it is possible to derive a new late-interaction-like architecture by carefully removing detrimental or unnecessary interactions. We name this architecture MICE (Minimal Interaction Cross-Encoders). We extensively evaluate MICE across both in-domain (ID) and out-of-domain (OOD) datasets. MICE decreases fourfold the inference latency compared to standard cross-encoders, matching late-interaction models like ColBERT while retaining most of cross-encoder ID effectiveness and demonstrating superior generalization abilities in OOD.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MICE, a minimal interaction cross-encoder that uses targeted masking of self-attention to isolate and retain only essential query-document interactions.
Empirical results show that MICE achieves nearly full cross-encoder ranking effectiveness with a 4x reduction in inference latency and improved out-of-domain performance.
The architecture enables efficient full-corpus retrieval by precomputing document vectors and applying layer dropping to maintain high ranking quality.

Minimal Interaction Cross-Encoders (MICE) for Efficient Re-ranking: An Expert Analysis

Motivation and Context

Neural Information Retrieval (IR) architectures have reached high effectiveness with interaction-heavy transformer-based cross-encoders, yet their inference cost remains prohibitive for full-corpus first-stage retrieval. This resource expenditure has maintained the two-stage retrieve-and-rerank paradigm, restricting cross-encoders to expensive reranking of candidate pools pre-selected by faster, less effective retrieval models (e.g., bi-encoders, BM25). Late-interaction models such as ColBERT offer improved efficiency, but their ranking performance lags behind cross-encoders. Prior acceleration methods targeting self-attention sparsity have not fully closed this gap.

This paper introduces MICE (Minimal Interaction Cross-Encoders), an approach leveraging interpretability and ablation studies to identify and eliminate non-essential self-attention interactions within cross-encoders, yielding an architecture that retains cross-encoder-level effectiveness but matches late-interaction efficiency.

Figure 1: MICE Architecture: stripping cross-encoders to keep the strict minimum interactions that maintain effectiveness.

Methodological Overview: Masking and Architectural Derivation

The authors systematically decompose the cross-encoder self-attention mechanism by masking specific interactions among input segments ([CLS], Query ( $Q$ ), Document ( $D$ ), \SEP{} tokens). Following interpretability studies, several masking steps are proposed:

Mask Step 0: Block information flow to \SEP{} tokens and from [CLS] to other parts, retaining attention sinks.
Mask Step 1: Prevent document-to-[CLS] transfers.
Mask Step 2: Remove query-to-document ( $D \leftarrow Q$ ) flows, hypothesized to be less critical.
Mask Step 3: In early layers, block bidirectional query-document interactions, enabling independent query/document contextualization (mid-fusion).

These steps isolate the minimum set of attention pathways required for effective ranking. Ablations confirm that judicious masking is ineffective when applied post-hoc to fine-tuned models, due to latent reliance on masked pathways. However, fine-tuning with these masks in place maintains or even improves effectiveness, particularly out-of-domain (OOD).

Figure 2: Masking approach. Interactions between input parts ([CLS], Q, D, \SEP{}) are blocked using cumulative masking.

MICE: Architecture and Implementation

MICE is constructed with three architectural innovations:

Mid-Fusion: Initial layers encode query and document independently, deferring interaction until later layers.
Light Cross-Attention: During interaction layers, only information transfer from a (frozen) document representation to the query is permitted. No further contextualization or updates to document tokens.
Layer Dropping: Final layers specialized for masked language modeling are pruned; only a minimal number of interaction layers (e.g., three) are retained, empirically shown to recover near-original performance.

Empirical evaluation demonstrates that MICE, when properly configured (e.g., MICE- $\ell$ 4+3 with MiniLM-L12-v2 backbone), can drop multiple late backbone layers without loss in ranking effectiveness.

Figure 3: MiniLM-L12-v2 backbone used for MICE, facilitating efficient layer dropping and minimal interaction.

Figure 4: Impact of dropping backbone's late layers in MICE. 3 interaction layers consistently recovers full performance.

Empirical Results

MICE was tested vigorously on standard re-ranking benchmarks:

In-Domain (ID): MS MARCO, TREC Deep Learning tracks.
Out-of-Domain (OOD): 13 BEIR benchmark datasets.

Key findings:

MICE achieves a $4\times$ reduction in inference latency compared to cross-encoders, closely matching ColBERT (late-interaction) efficiency.
In ID, MICE nearly recovers full cross-encoder effectiveness (e.g., nDCG@10 loss $<1$ ).
OOD generalization improves in MICE versus baselines, indicating regularization from interaction minimization.
On "BM25-hard" datasets—where traditional cross-encoders underperform—MICE outperforms its own baseline and ColBERT.
Scaling laws show MICE effectiveness grows with backbone size identically to standard cross-encoders; performance costs for minimal interaction are marginal.
Figure 5: Scaling law of MICE against standard cross-encoder using backbones from the Ettin suite.

Efficiency Analysis

Inference profiling on MiniLM-L12-v2 backbone revealed:

Standard cross-encoder: $470\,ms$ , 267 docs/s, $1193\,MB$ peak memory.
MICE (with offline document encoding): $113\,ms$ latency per query, $1130$ docs/s, $598\,MB$ .
ColBERT: $130\,ms$ , $982$ docs/s, $331\,MB$ .

MICE is $4\times$ faster than cross-encoder and marginally faster than ColBERT, though its memory footprint is slightly higher due to frozen document representations. When run without precomputed document encoding, speedup is $2\times$ .

Contradictory Claims and Strong Results

The study demonstrates contradictory evidence to prior assumptions: masking query-to-document interactions ( $D \leftarrow Q$ ) is not detrimental, and in fact, enhances OOD generalization. Masking document-to-query ( $Q \leftarrow D$ ) as done in Sparse CE does not yield comparable effectiveness. Strong numerical claims: MICE obtained up to $11.3$ nDCG@10 on BM25-hard datasets above its own baselines.

Practical and Theoretical Implications

Practically, MICE facilitates deployment of cross-encoder-level ranking models as first-stage retrievers, enabling full-corpus search with acceptable latency and resource usage. The minimal interaction design allows for document vector pre-computation, akin to late-interaction models, integrating with existing IR infrastructure.

Theoretically, the findings support a refined understanding of transformer-based cross-encoders in IR, revealing which interaction directions are crucial and which are redundant, aligning with interpretability studies. Masking unnecessary attention pathways acts as a regularizer, improving robustness and generalization.

(Figure 1 repeated)

Figure 1: MICE Architecture: stripping cross-encoders to keep the strict minimum interactions that maintain effectiveness.

Speculation on Future Development in AI

MICE's reductionist approach can guide future design of efficient, accurate neural ranking architectures, especially in large-scale, heterogeneous retrieval contexts. Extending MICE to larger backbones, integrating more aggressive dimensionality compression with distillation, and automating optimal layer selection or masking based on corpus-specific data may further propel efficiency. These developments may enable cross-encoder models to be standard in real-time, full-corpus retrieval—overcoming the entrenched trade-off between speed and ranking effectiveness in neural IR.

Conclusion

By combining targeted masking of self-attention interactions, mid-fusion contextualization, light cross-attention, and layer dropping, Minimal Interaction Cross-Encoders (MICE) achieve a new effectiveness-efficiency operating point for neural IR. MICE matches late-interaction efficiency while almost fully preserving cross-encoder effectiveness, and demonstrates superior out-of-domain generalization. The approach re-defines architectural requirements for high-performing rankers and establishes a foundation for scalable, robust, and precise neural retrieval systems.

Reference: "MICE: Minimal Interaction Cross-Encoders for efficient Re-ranking" (2602.16299)

Markdown Report Issue