Mamba Retriever: SSM-Based Scalable Retrieval
- Mamba Retriever is a family of retrieval mechanisms using SSM blocks to replace transformer self-attention with linear-time recurrence, enabling scalable long-context processing.
- It integrates bi-encoder setups for dense text, register banks for visual-temporal tracking, and dual-branch multimodal fusion for video retrieval to ensure robust performance.
- Empirical results reveal that Mamba Retriever outperforms transformer baselines in metrics like MRR@10, nDCG@10, and inference speed, highlighting its practical efficiency.
Mamba Retriever refers to a family of retrieval and temporal integration mechanisms grounded in the Mamba Selective State-Space Model (SSM) architecture, which replaces transformer self-attention with linear-time recurrence, enabling efficient and effective retrieval across dense text, video, and temporal visual tasks. Distinct "Mamba Retriever" implementations have been proposed for dense text passage retrieval, temporally robust medical vision tracking, and long-context multimodal video retrieval, all leveraging the linear efficiency and long-range capacity of Mamba-based SSM blocks.
1. Foundations of the Mamba SSM Architecture
Unlike canonical transformers, which use multi-head self-attention with complexity for sequences of length , the Mamba architecture employs stacks of Selective State Space Model (SSM) blocks. Each SSM block maintains a recurrent state and updates it per token as follows: where , and are adaptively parameterized per input token. This yields processing per layer. In practical hardware implementations, these updates can be efficiently parallelized. No positional embeddings are strictly required, though optional small learned offsets or rotary embeddings can be included. The recursive state mechanism enables effective modeling of long-range dependencies beyond the practical limits of transformer attention (Zhang et al., 2024).
2. Mamba Retriever for Dense Textual Retrieval
The primary "Mamba Retriever" for dense information retrieval employs a bi-encoder setup with two identical Mamba encoders—one for queries and one for passages . The encoded output is obtained by appending a 0 token and extracting the final hidden state: 1 Similarity between query and passage embeddings is measured using cosine similarity: 2 The model is fine-tuned with an InfoNCE contrastive loss to maximize similarity for positive pairs and minimize it for negatives: 3 Empirical evaluation on the MS MARCO and BEIR benchmarks establishes that Mamba Retriever matches or surpasses transformer-based baselines in both accuracy (MRR@10, nDCG@10) and recall, with inference times that scale linearly in sequence length. For instance, Mamba-130M achieves MRR@10 = 32.3, nDCG@10 = 40.54 on MS MARCO and BEIR, outperforming BERT-base and RoBERTa-base. For very long contexts (e.g., 8K tokens in LoCoV0), Mamba-130M reaches 90.7 nDCG@10 while transformer models become prohibitively slow or nonviable. Inference latency at 8K tokens sees transformer-based models at %%%%1011%%%% Mamba's time, making Mamba preferable for large-scale, long-text first-stage retrieval (Zhang et al., 2024).
3. Mamba Retriever for Robust Visual-Temporal Tracking
In visual domains, the Mamba Retriever design has been adapted to track rapidly moving targets with occlusion or signal degradation. The MrTrack system, developed for ultrasound-guided fine needle aspiration, introduces a Mamba-based register extractor and retriever for real-time temporal context integration (Zhang et al., 14 May 2025).
The system employs:
- A ViT-Base backbone yielding spatial features.
- A register extractor that interleaves a small learnable register 6 into the patch sequence of current search features 7, processes with a Mamba SSM, and outputs both denoised features 8 and a distilled temporal descriptor 9.
- A FIFO register bank 0 storing up to 1 descriptors.
- A register retriever that fuses historical 2 into the template feature 3 (from the initial frame) via cross-map scanning with another Mamba SSM, producing a dynamic template 4.
- A prediction head applying cross-attention between 5 and 6 for final tip localization.
A self-supervised register diversify loss 7 regularizes the register bank for temporal and feature dimension decorrelation, combining a variance-promoting term and a cross-register diversify term: 8 with detailed forms regulating both intra-dimension variance and inter-register covariance collapse.
Ablations demonstrate that Mamba-based register retrieval is critical for both robustness (ΔAUC = –6.4% if replaced with transformer self-attention) and speed (–12.6 FPS for transformer baseline), confirming the architectural advantage (Zhang et al., 14 May 2025).
4. Mamba Retriever Extensions for Multimodal and Temporal Retrieval
The MamFusion architecture generalizes the Mamba Retriever for partially relevant video retrieval, handling long untrimmed video and text queries by leveraging a multi-Mamba dual-branch framework with temporal fusion (Ying et al., 4 Jun 2025). In MamFusion:
- Text and video streams are independently encoded, with video processed at both frame and clip levels through stacks of GMMFormer attention and Mamba SSM blocks.
- Mamba blocks in video branches have a state dimension 9 and convolution width 0, jointly capturing local and global temporal structure.
- Bidirectional temporal fusion modules conduct (i) Temporal Video-to-Text (TVT) attention, fusing Mamba video outputs into text features, and (ii) Temporal Text-to-Video (TTV) attention, fusing pooled textual representations into video features.
- Retrieval scores after fusion are based on the cosine similarity between bidirectionally fused text and video representations.
MamFusion achieves state-of-the-art recall on ActivityNet, Charades-STA, and TVR, with substantial gains attributed to its dual-stream multi-Mamba design and temporal fusion. Ablation shows that removing the multi-Mamba component or either fusion block substantially degrades retrieval SumR (Ying et al., 4 Jun 2025).
5. Comparative Analysis and Empirical Outcomes
Empirical results across diverse domains consistently show Mamba Retriever architectures achieving strong accuracy while scaling efficiently with sequence length:
- For text retrieval, Mamba models match or outperform transformers at all sizes on MS MARCO and BEIR, and reach superior performance on long-text LoCoV0 (e.g., Mamba-130M: 2K tokens nDCG@10 = 89.1; 8K = 90.7).
- In visual tracking, MrTrack outperforms state-of-the-art trackers in both mean AUC (motorized: 65.4% vs 60.8%) and inference speed (73.9 FPS).
- In video retrieval, MamFusion leads benchmarks with the highest SumR across datasets, with ablation verifying the necessity of Mamba-based temporal modeling for partial relevance.
This performance is largely rooted in the linear time complexity of the SSM-based Mamba blocks, their ability to absorb much longer temporal context post-fine-tuning, and architectural adaptations (e.g., register banks, fusion blocks) designed for specific retrieval modalities (Zhang et al., 2024, Zhang et al., 14 May 2025, Ying et al., 4 Jun 2025).
6. Architectural Implications and Significance
Mamba Retriever approaches demonstrate the viability of state-space modeling as a robust alternative to self-attention, particularly in contexts demanding long-context understanding, real-time inference, and temporal robustness. Selective SSM updates allow for much larger input windows with only minor slowdowns, and bidirectional fusion or register-based prompting can preserve and inject context across modalities and frames.
A plausible implication is that future retrieval models across domains (text, vision, video, and multimodal) will increasingly adopt Mamba-derived architectures, especially as sequence lengths and deployment efficiency requirements continue to grow. The diverse empirical successes across benchmarks suggest that SSM-based retrieval may form a new backbone family for scalable, context-rich retrieval systems.