Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 53 tok/s

Gemini 2.5 Pro 36 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 94 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Recurrence Meets Transformers for Universal Multimodal Retrieval (2509.08897v1)

Published 10 Sep 2025 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-LLMs and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

Summary

The paper introduces ReT-2, a recurrent Transformer that dynamically fuses textual and visual features via layer-specific gating mechanisms.
It employs layer pruning and token reduction to improve efficiency while maintaining robust semantic alignment and retrieval accuracy.
Empirical evaluations show state-of-the-art performance on benchmarks like M-BEIR, demonstrating scalability for universal multimodal retrieval tasks.

Recurrence-Enhanced Transformers for Universal Multimodal Retrieval

Introduction and Motivation

The paper introduces ReT-2, a unified multimodal retrieval model designed to support queries and documents containing both text and images. The motivation stems from the limitations of prior vision-language retrieval systems, which typically restrict themselves to single-modality queries or documents and rely on final-layer features for representation. ReT-2 addresses these constraints by leveraging multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms, enabling dynamic integration of information across layers and modalities. This design facilitates fine-grained semantic alignment and robust retrieval in highly compositional multimodal scenarios.

Figure 1: ReT-2 achieves superior average performance across M-BEIR tasks compared to previous methods, supporting diverse multimodal retrieval configurations.

Architectural Innovations

Recurrent Cell Design

The core of ReT-2 is a Transformer-based recurrent cell that fuses layer-specific visual and textual features. At each layer, the cell merges its hidden state with features from both modalities using cross-attention, followed by gated linear combinations modulated by learnable forget and input gates. This mechanism allows selective retention of information from shallower layers and dynamic modulation of unimodal feature flow, enhancing the model's ability to capture both low-level and high-level details.

Figure 2: The recurrent cell integrates layer-specific textual and visual features into a matrix-form hidden state, enabling multi-layer fusion.

Layer Pruning and Token Reduction

To improve efficiency and robustness, ReT-2 prunes the number of layers processed by the recurrent cell, sampling three representative layers (early, middle, late) from each backbone. Additionally, the model reduces the number of input tokens from 32 to a single token per modality, addressing rank collapse observed in previous architectures and simplifying the contrastive objective.

Figure 3: ReT-2 introduces token reduction, layer pruning, and global feature injection, differentiating it from the original ReT.

Global Feature Injection

ReT-2 augments the output of the recurrent cell with global features from the pooler tokens of the visual and textual backbones. This integration provides broader contextual information, further improving retrieval accuracy and robustness.

Training and Implementation Details

ReT-2 employs shared weights between query and document encoders, reducing model complexity and promoting consistent representation learning. Training utilizes the InfoNCE loss over the single fused token from both query and document sides. The model is compatible with various backbone architectures (CLIP, SigLIP2, OpenCLIP, ColBERTv2), and layer selection is standardized for architectural compatibility. Mixed precision training and gradient checkpointing are used for efficiency, with Faiss employed for fast retrieval during inference.

Empirical Evaluation

Multimodal Retrieval Benchmarks

ReT-2 is evaluated on the M2KR and M-BEIR benchmarks, encompassing a wide range of multimodal retrieval tasks and domains. Across all configurations, ReT-2 consistently outperforms prior methods, including FLMR, PreFLMR, UniIR, GENIUS, and MLLM-based retrievers. Notably, ReT-2 achieves state-of-the-art recall metrics, with substantial gains observed when backbones are unfrozen and scaled.

Ablation Studies

Ablation analyses demonstrate the effectiveness of each architectural modification. Token reduction and layer pruning yield efficiency gains without sacrificing performance, while global feature injection provides a +2.6 point improvement over the original ReT. Shared encoder weights further reduce overfitting, particularly on entity-centric datasets.

Figure 4: Gate activation analysis reveals the importance of selected layers for multimodal fusion, supporting the layer pruning strategy.

Computational Efficiency

ReT-2 offers significant improvements in inference speed and memory usage compared to fine-grained late-interaction models and MLLM-based retrievers. The use of a single token and pruned layers enables faster forward and retrieval times, making ReT-2 suitable for large-scale deployment.

Retrieval-Augmented Generation for VQA

ReT-2 is integrated into retrieval-augmented generation pipelines for knowledge-intensive visual question answering (VQA) tasks, such as Encyclopedic-VQA and InfoSeek. When paired with off-the-shelf MLLMs (LLaVA-MORE, Qwen2.5-VL), ReT-2 enables higher answer accuracy without task-specific fine-tuning, outperforming both general-purpose and task-specific retrievers.

Figure 5: Sample results for InfoSeek VQA, showing improved answer accuracy when Qwen2.5-VL is augmented with ReT-2-retrieved context.

Figure 6: Sample results for Encyclopedic-VQA, demonstrating ReT-2's ability to retrieve relevant multimodal documents for complex questions.

Theoretical and Practical Implications

ReT-2 demonstrates that multi-layer feature integration via recurrence and gating mechanisms is highly effective for universal multimodal retrieval. The model's architectural simplicity, efficiency, and robustness position it as a practical backbone for retrieval-augmented generation and other downstream multimodal tasks. The findings suggest that reliance on large MLLMs for retrieval can be mitigated by principled architectural design, enabling scalable and generalizable solutions.

Future Directions

Potential future developments include extending ReT-2 to additional modalities (audio, video), exploring adaptive layer selection strategies, and integrating more advanced gating mechanisms. The approach may also be adapted for real-time retrieval in interactive systems and further optimized for low-resource deployment.

Conclusion

ReT-2 establishes a new standard for universal multimodal retrieval by combining multi-layer feature fusion, recurrence, and gating within a unified Transformer framework. The model achieves strong empirical results across diverse benchmarks, offers significant efficiency gains, and enhances downstream performance in retrieval-augmented generation tasks. These contributions underscore the value of recurrent integration for scalable and robust multimodal retrieval systems.