Cross-Modal Retrieval Overview

Updated 12 October 2025

Cross-Modal Retrieval is an information retrieval paradigm that uses a query in one modality to retrieve semantically relevant items from another.
Techniques like CCA, deep dual-tower neural encoders, and divergence measures align heterogeneous data to enhance retrieval accuracy in applications from multimedia search to medical imaging.
Ongoing challenges include scalability, robust semantic alignment, and interactive retrieval strategies to adapt to dynamic, multi-modal data environments.

Cross-modal retrieval is a fundamental information retrieval paradigm wherein a query in one modality (such as text, image, audio, video, tactile, or other sensor output) is used to retrieve semantically relevant items from a database belonging to a different modality. This approach addresses the heterogeneous gap between multiple data distributions and representation spaces, and is foundational in applications including multimedia search, knowledge-based question answering, medical data integration, remote sensing, robotics, and retrieval-augmented generation.

1. Foundations and Problem Formulation

The cross-modal retrieval problem consists of two principal requirements: learning shared (or at least coordinated) representations for heterogeneous modalities, and defining/optimizing a compatibility metric such that semantically related instances from different modalities are brought into proximity within that learned space. Canonical correlation analysis (CCA) and its nonlinear and deep extensions have long been used to project features $\mathbf{x} \in \mathcal{X}$ and $\mathbf{y} \in \mathcal{Y}$ into a common latent space $\mathcal{Z}$ by maximizing statistical correlation, yielding projections $W_x^\top \mathbf{x}$ and $W_y^\top \mathbf{y}$ with maximum cross-modal correlation (Wang et al., 2023). Deep learning has extended this setup with end-to-end neural feature encoders and sophisticated alignment losses—contrastive, metric, adversarial, mutual information, and divergence-based—which cope with increased nonlinearities and semantic abstraction.

The task is generally formalized as: Given query $q$ in modality $\mathcal{M}_a$ and a database $\mathcal{D}_b$ in $\mathcal{M}_b$ , retrieve the top- $k$ elements which, under the retrieval metric $f$ , maximize $f(q, d)$ (e.g., cosine similarity in the shared space) for $d \in \mathcal{D}_b$ .

Shallow and Deep Coordination

Early approaches include:

CCA and its kernelized variants: maximize linear (or kernel) parametric correlation between two embedding functions.
Manifold alignment: e.g., Cross-Modal Manifold Learning (Cᴹ²L) jointly preserves local and global geometric data structures across modalities by constructing affinity matrices and aligning graphs via partially corresponding anchors (Conjeti et al., 2016).
Dictionary and topic models: discover latent semantic correspondences via matrix factorization or shared topics.

Modern frameworks use:

Dual-tower neural encoders: Each modality passes through a modality-specific encoder and projects into a joint embedding space, often trained with pairwise contrastive loss as exemplified in CLIP-style models (Sánchez et al., 29 Jan 2024).
Fusion strategies: Some methods use a single fused network for all modalities (e.g., image-text fusion via a single network (Nawaz et al., 2018)); others rely on multi-head attention, cross-modal mixers, or encoder-decoder fusion (e.g., in video-audio fusion with a fuse-then-separate autoencoder (Yuan et al., 2023)).
Divergence-based alignment: Recent work introduces hyperparameter-free divergence measures such as Cauchy-Schwarz (CS) divergence for bi-modal and Generalized CS (GCS) for higher-order cases (unifying three or more modalities) (Zhang et al., 15 Sep 2025).

Key considerations in cross-modal alignment include:

Preserving local and global structure: Algorithms must maintain neighborhood relations and semantic topology to avoid overfitting local similarities or collapsing diverse content into trivial solutions (Conjeti et al., 2016, Thomas et al., 2020).
Symmetric and asymmetric loss design: Many models utilize symmetric objectives to ensure both modalities are equally well-aligned (e.g., symmetric cross-entropy over pairwise similarities or GCS divergence over circular triplets of modalities) (Zhang et al., 15 Sep 2025, Sánchez et al., 29 Jan 2024).
Handling modality gaps: Various methods specifically seek to reduce the "modality gap" via mutual information maximization (Gu et al., 2021), adversarial learning (Huang et al., 2017), or regressing cosine similarities toward principled targets (Sánchez et al., 29 Jan 2024).

3. Architectures and Methodological Advances

Affinity Construction and Graph Alignment

The Cᴹ²L method (Conjeti et al., 2016) demonstrates how both intra- and inter-modal affinities can be incorporated:

Intra-modal affinity: Constructed using locally scaled distances and perturbed minimum spanning tree (pMST) averaging to robustly outline the "skeleton" of each modality’s data manifold.
Inter-modal affinity: Built by linking intra-modal affinities through partial instance correspondences, propagating plausible cross-modal similarities, which is computationally summarized as

$W_{12}^{(ij)} = \max_{k \in L} \sqrt{W_{11}^{(ik)} \cdot W_{22}^{(kj)}}$

where $L$ is the set of matched cross-modal instances.

Transfer and Adversarial Learning

MHTN (Huang et al., 2017) leverages knowledge transfer from large-scale unimodal datasets (e.g., ImageNet) to boost cross-modal embedding. A modal-sharing transfer subnetwork adapts knowledge from source to targets, while a modal-adversarial subnetwork enforces that the joint embedding becomes discriminative of semantic labels but invariant to modality identity. The joint objective, involving MMD for domain adaptation and a gradient reversal strategy for modality adversarial regularization, is critical in settings with limited paired data.

Implicit Concept Association and Multiple Instance Learning

Implicit cross-modal association (Song et al., 2018) departs from explicit object-level correspondences. Instead, via multiple instance learning, embeddings reflect potentially ambiguous or abstract concept-level links, with learning guided by max-pooling over instance-level similarities to organize high-level semantic alignment even in the presence of weakly associated data.

Emerging research has moved beyond bi-modal setups:

Hybrid objectives: Cross-modal RAG performs sub-dimensional query decomposition and Pareto-optimal hybrid retrieval, aligning dense fine-grained similarity with sparse (lexical) subquery satisfaction, enabling multi-aspect matching for complex queries (Zhu et al., 28 May 2025).
GCS divergence and circular matching: GCS allows tri- and higher-modal alignment using a scalable extension of CS divergence, capturing global semantic consistency more efficiently than exhaustive pairwise alignment (Zhang et al., 15 Sep 2025).
Robotic and tactile applications: VAT-CMR integrates visual, audio, and tactile features into a shared space, introducing dominant modality selection during training to improve discriminative separability (Wojcik et al., 30 Jul 2024).

4. Evaluation Protocols, Benchmarks, and Applications

Datasets and Metrics

Typical datasets include Wikipedia (image–text), MS-COCO, NUS-WIDE, Pascal Sentences, remote sensing (DSRSID), medical imaging (BraTS), and domain-specific corpora for tri- or higher-modal scenarios (e.g., CUB, KIT-ML, Flickr8k Audio). Key performance metrics:

Mean Average Precision (mAP)
Recall@K (for K-nearest retrievals)
Median rank, R-Precision
Specialized domain metrics for medical or multi-label retrieval

Robustness and Comparative Results

State-of-the-art methods have demonstrated:

Improved cross-modal retrieval accuracy over baselines such as CCA and LCFS for both classification and regression tasks (Conjeti et al., 2016).
Stable retrieval under limited data and varying heterogeneity, with robust transfer learning and adversarial training showing significant precision and recall gains (Huang et al., 2017).
Modular improvements: Conversion from KL-based to CS/GCS divergence yields both numerical and semantic alignment gains, especially in unstable or zero-shot settings (Zhang et al., 15 Sep 2025).
In multi-modal, non-vision-language setups (Audio–Visual–Tactile), attention-based fusion and dominant modality selection strategies provide measurable MAP improvements over classic canonical correlation baselines (Wojcik et al., 30 Jul 2024).

Application Domains

Medical Imaging: Aggregating complementary information from multiple imaging modalities underpins computer-assisted diagnostics (Conjeti et al., 2016).
Remote Sensing: PAN–multispectral image retrieval, multi-label image–speech search (Chaudhuri et al., 2019).
Multimedia retrieval: Video–audio fusion, style/content–aspect disentanglement (Yuan et al., 2023, Zhu et al., 28 May 2025).
Robotics and tactile perception: Enabling multi-sensor fusion for robust cross-modal retrieval in object-centric robot tasks (Wojcik et al., 30 Jul 2024).
Zero-shot and knowledge-based inference: Cross-modal retrieval applied as an inference-augmentation technique, e.g., for zero-shot image classification using CLIP feature spaces (Eom et al., 2023).

5. Practical and Computational Considerations

Training Stability and Scalability

The introduction of CS/GCS divergence brings hyperparameter-free, numerically stable alternatives to commonly used divergences (KL, MMD, CORAL), enhancing convergence and reducing the need for heuristic smoothing or kernel selection (Zhang et al., 15 Sep 2025). Both computational and practical gains are evident, particularly as the number of modalities increases—the GCS framework’s linear (rather than quadratic) scaling with modality count is a notable advantage for future multi-modal applications.

Fast, Non-training Approaches

Closed-form linear and orthogonal mappings between pretrained embedding spaces (e.g., Procrustes solution or least-squares) allow for practical cross-modal retrieval without deep joint training or fine-tuning (Choi et al., 2023). These methods can reach up to 77% recall@10 in standard benchmarks when using strong off-the-shelf unimodal transformers, with performance improved via lightweight gMLP refinement and/or modest contrastive learning.

Toolbox and Reproducibility

A number of reviewed works make code and unified frameworks public (notably, https://cross-modal-retrieval.github.io and https://github.com/JiahaoZhang666/CSD), facilitating comparative benchmarking and rapid prototyping across a breadth of architectures and loss regimes (Wang et al., 2023, Zhang et al., 15 Sep 2025).

6. Open Challenges and Future Directions

Despite substantial progress, important research challenges remain:

Scalability to large and dynamic collection: Efficient distributed architectures and federated learning methods are required for real-time, privacy-sensitive, or continuously growing datasets (Wang et al., 2023).
Robust and adaptive semantic modeling: Improving uncertainty handling, weak- and self-supervised alignment, and robust multi-modal data integration—in face of noisy, incomplete, or unpaired data—are unresolved problems (Wang et al., 2023).
Interactive and user-adaptive retrieval: Incorporation of user intent, feedback loops, and dynamic fusion of retrieval evidence is an open area, particularly as multi-modal real-world queries grow in complexity.
Higher-modality and compositionality: Approaches such as GCS divergence, hybrid Pareto retrieval, and multi-modal enrichment (Zhang et al., 15 Sep 2025, Zhu et al., 28 May 2025, Sánchez et al., 29 Jan 2024) suggest that unified frameworks for composition and multi-frequency matching will become increasingly central as the number and diversity of input and query modalities expand.
Domain transfer and generalizability: Ongoing work examines transfer across domains, cross-modal domain adaptation, and leveraging external knowledge for knowledge-based QA, zero-shot tasks, or rapid adaptation to novel retrieval targets (Lerner et al., 11 Jan 2024, Eom et al., 2023).

Approach	Modalities	Principle	Key Features
Cᴹ²L (Conjeti et al., 2016)	Images (medical)	Manifold alignment	Global/local geometry, pMST, anchor correspond.
MHTN (Huang et al., 2017)	Images, text, etc.	Adversarial transfer	Single-modal to multi-modal transfer, GRL
CMST (Wen et al., 2019)	Images/text	Similarity transfer	Siamese intra-modal similarity, cross-modal loss
CS/GCS (Zhang et al., 15 Sep 2025)	n-modal (>=2)	Divergence alignment	CS divergence, GCS extension, circular matching
PCMC/PCMR (Sánchez et al., 29 Jan 2024)	n-modal	Unified contrast./regression	CLIP contrastive extension, flexible comb.
VAT-CMR (Wojcik et al., 30 Jul 2024)	Visual/audio/tactile	Attention fusion	Multi-head fusion, dominant modality selection
Cross-modal RAG (Zhu et al., 28 May 2025)	Text/image	Sub-dim. retrieval	Hybrid sparse+dense, Pareto retrieval, MLLM

This domain continues to advance rapidly, with unified, stable, and efficient multi-modal alignment representing a central challenge for large-scale retrieval, robust data mining, and next-generation intelligent systems.