Unsupervised Generative Adversarial Cross-modal Hashing (1712.00358v1)

Published 1 Dec 2017 in cs.CV

Abstract: Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space, which can realize fast and flexible retrieval across different modalities. Unsupervised cross-modal hashing is more flexible and applicable than supervised methods, since no intensive labeling work is involved. However, existing unsupervised methods learn hashing functions by preserving inter and intra correlations, while ignoring the underlying manifold structure across different modalities, which is extremely helpful to capture meaningful nearest neighbors of different modalities for cross-modal retrieval. To address the above problem, in this paper we propose an Unsupervised Generative Adversarial Cross-modal Hashing approach (UGACH), which makes full use of GAN's ability for unsupervised representation learning to exploit the underlying manifold structure of cross-modal data. The main contributions can be summarized as follows: (1) We propose a generative adversarial network to model cross-modal hashing in an unsupervised fashion. In the proposed UGACH, given a data of one modality, the generative model tries to fit the distribution over the manifold structure, and select informative data of another modality to challenge the discriminative model. The discriminative model learns to distinguish the generated data and the true positive data sampled from correlation graph to achieve better retrieval accuracy. These two models are trained in an adversarial way to improve each other and promote hashing function learning. (2) We propose a correlation graph based approach to capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have smaller Hamming distance and promote retrieval accuracy. Extensive experiments compared with 6 state-of-the-art methods verify the effectiveness of our proposed approach.

Authors (3)

Jian Zhang (543 papers)
Yuxin Peng (65 papers)
Mingkuan Yuan (7 papers)

Citations (191)

View on Semantic Scholar

Summary

Unsupervised Generative Adversarial Cross-modal Hashing

The paper entitled "Unsupervised Generative Adversarial Cross-modal Hashing" presents an innovative approach to cross-modal hashing, leveraging the generative adversarial network (GAN) paradigm to address the challenges inherent in unsupervised learning from heterogeneous multimedia data. The significance of hashing methods in efficient retrieval from extensive multimedia databases is well-understood, primarily due to their ability to transform high-dimensional representations into compact binary codes within a common Hamming space. This transformation facilitates rapid computation of Hamming distances, essential for search operations in vast datasets.

Methodology Overview

The primary contribution of this research is the UGACH framework, which intertwines GAN's capabilities with hashing strategies to reveal and utilize manifold structures in multimedia data without the need for labeled training data. The UGACH comprises a generative model and a discriminative model that operate adversarially. Here, the generative model selects data from a different modality, modeling the distribution over the manifold structures to challenge the discriminative model. The discriminative model aims to distinguish between generated data and true data derived from a correlation graph, thus enhancing retrieval accuracy by promoting robust hashing function learning.

A notable enhancement in UGACH is the integration of a correlation graph, which captures the manifold structure across modalities, aligning modalities through smaller Hamming distances. This system fortifies the discriminative model with manifold correlation guidance, significantly boosting retrieval accuracy.

Experimental Validation

Empirical evaluation on two widely-utilized datasets, NUS-WIDE and MIRFLICKR, affirms UGACH's superiority compared to six state-of-the-art methods, encompassing both unsupervised (e.g., CVH, PDH) and supervised (e.g., SCM) approaches. The UGACH consistently delivers higher Mean Average Precision (MAP) scores across varying bit lengths in both image-to-text and text-to-image retrieval tasks, a testament to its efficacy in manifold-guided representation learning. Moreover, UGACH outperforms baseline methods, demonstrating clear improvements imparted by the adversarial training and correlation graph components.

Implications and Future Directions

The outcomes of UGACH hold substantial promise for multimedia information retrieval systems, especially in scenarios where labeled data is scarce or unattainable. By harnessing manifold structures through unsupervised GAN, UGACH suggests practical avenues for application in dynamic, multimodal environments—potentially influencing domains ranging from image tagging to cross-modal content recommendation services.

Future explorations may include expanding UGACH's capabilities to support multi-modal retrieval encompassing video, audio, and other data types. Additionally, adapting the framework for tasks like image caption generation could further showcase its flexibility and applicability.

The systematic approach to capturing unsupervised cross-modal correlations positions UGACH as a pivotal step towards efficient and scalable retrieval systems, driving further experimentation and innovation in AI-based multimedia retrieval research.

PDF Markdown