A Comprehensive Survey on Cross-modal Retrieval (1607.06215v1)

Published 21 Jul 2016 in cs.MM, cs.CL, and cs.IR

Abstract: In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different modalities of data remains a challenge. Various methods have been proposed to deal with such a problem. In this paper, we first review a number of representative methods for cross-modal retrieval and classify them into two main groups: 1) real-valued representation learning, and 2) binary representation learning. Real-valued representation learning methods aim to learn real-valued common representations for different modalities of data. To speed up the cross-modal retrieval, a number of binary representation learning methods are proposed to map different modalities of data into a common Hamming space. Then, we introduce several multimodal datasets in the community, and show the experimental results on two commonly used multimodal datasets. The comparison reveals the characteristic of different kinds of cross-modal retrieval methods, which is expected to benefit both practical applications and future research. Finally, we discuss open problems and future research directions.

PDF Abstract

A Survey on Cross-modal Retrieval Techniques

Cross-modal retrieval has emerged as a critical area of paper due to the proliferation of multimodal data in contemporary digital ecosystems. This paper offers a comprehensive survey of various strategies in cross-modal retrieval, which is essential for enabling effective information retrieval across diverse data types such as images, text, and videos.

Overview of Cross-modal Retrieval Methods

The paper categorizes cross-modal retrieval methods into two primary groups: real-valued representation learning and binary representation learning, also known as cross-modal hashing. Real-valued representation learning focuses on deriving common continuous representations for diverse modalities, while binary representation learning compresses these representations into binary codes to enhance retrieval speed through hash functions.

Real-valued Representation Learning

Real-valued representation learning approaches are further divided into unsupervised methods, pairwise based methods, rank based methods, and supervised methods. Each category utilizes varying degrees of information, ranging from co-occurrence in unsupervised methods to label-driven constraints in supervised techniques, resulting in more discriminative subspace constructions.

Unsupervised Methods: These methods rely on co-occurrence data to model latent common spaces. Representative techniques include Canonical Correlation Analysis (CCA) and its variants such as Deep CCA (DCCA).
Pairwise Based Methods: These methods utilize positive and negative pair information to fine-tune inter-modal metrics, aiming to better align related data from different modalities.
Rank Based Methods: These approaches model retrieval as a ranking problem, optimizing the ordering of potential matches to align with semantic relevance.
Supervised Methods: By using known labels, these methods enhance the separability and discriminative power of the learning subspaces. Techniques such as Generalized Multiview Analysis (GMA) and Cross-modal Discriminant Feature Extraction (CDFE) exemplify this category.

Binary Representation Learning

Hashing methods provide efficient retrieval through binary coding schemes. The surveyed work categorizes these methods into linear and nonlinear function modeling, further divided by the nature of supervision:

Unsupervised Methods: Techniques like Collective Matrix Factorization Hashing (CMFH) use matrix factorization to infer shared binary codes from data across modalities.
Pairwise Based Methods: These methods, such as Co-Regularized Hashing (CRH), incorporate similar/dissimilar pair information to refine binary code mappings across modalities.
Supervised Methods: Incorporating label-driven supervision, approaches like Semantic Correlation Maximization (SCM) optimize hashing functions to directly reflect semantic similarities.

Experimental Evaluation

The paper evaluates various cross-modal retrieval methods on datasets such as the Wiki and NUS-WIDE. Results demonstrate the superior performance of supervised learning methods over unsupervised ones, indicating the benefit of leveraging labels for semantic alignment. Additionally, binary representation learning methods like SePH outperform others by effectively preserving semantics in the Hamming space.

Future Research Directions

The paper highlights several avenues for future research:

Development of large-scale, multimodal datasets to facilitate comprehensive evaluation and training of algorithms.
Methods to deal with limited and noisy annotations prevalent in real-world multimodal data.
Scalability and efficiency improvements for handling large datasets.
Increased application of deep learning techniques for improved feature representation and cross-modal correlations.
Exploration of finer-level semantic correspondences across modalities, rather than coarse-level common spaces.

Conclusion

Cross-modal retrieval is positioned at the nexus of data integration and retrieval, with a significant impact on enhancing user access to vast multimodal repositories. The surveyed methods underscore a trajectory towards more nuanced, scalable, and semantically aligned retrieval systems. As the field continues to grow, these insights will be pivotal in shaping next-generation solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kaiye Wang (3 papers)
Qiyue Yin (16 papers)
Wei Wang (1793 papers)
Shu Wu (109 papers)
Liang Wang (512 papers)

Citations (277)

View on Semantic Scholar