UGACH: Cross-Modal Retrieval via GANs
- The paper introduces UGACH, integrating GANs with a k-NN correlation graph to learn unified Hamming representations, achieving up to +0.13 MAP improvement.
- It employs a generative model to sample 'hard' negatives and a discriminative triplet loss to enforce manifold consistency across modalities.
- Experimental results on NUS-WIDE and MIRFlickr show UGACH outperforms both unsupervised and supervised methods in cross-modal retrieval tasks.
Unsupervised Generative Adversarial Cross-modal Hashing (UGACH) is an unsupervised framework for cross-modal retrieval, mapping heterogeneous multimedia data—such as images and texts—into a unified Hamming space. UGACH leverages generative adversarial networks (GANs) to exploit the underlying manifold structure across different modalities, enabling efficient cross-modal hashing and retrieval without annotation. The approach combines generative adversarial learning with a correlation-graph-based manifold prior, yielding superior retrieval performance relative to both state-of-the-art unsupervised and supervised methods (Zhang et al., 2017).
1. Architectural Components
UGACH consists of two primary modules: a generative model (G) and a discriminative model (D), each operating on two parallel pathways—one for each modality (image and text). The feature extraction pipeline represents each modality’s data via high-dimensional descriptors (e.g., 4,096-D VGG features for images, 1,000-D Bag-of-Words for text). Both pathways employ a common-representation layer
followed by a hashing layer (continuous relaxation)
where is the sigmoid function. Binary hash codes for retrieval are obtained at inference as
Given a query in one modality, the generative model uses the hashing outputs to define a discrete probability distribution for selecting a cross-modal sample :
This mechanism highlights “hard” negatives—cross-modal samples plausible under the (unknown) manifold structure.
The discriminative model learns to score query-candidate pairs by a triplet-based function
where denotes a “true” manifold neighbor (see Section 3), and . The discrimination is finalized with a sigmoid , promoting proximity in Hamming space for true manifold pairs and discrimination against ’s samples.
2. Minimax Adversarial Objective
UGACH is trained by a minimax adversarial game. Let denote the empirical neighbor-based manifold distribution. The value function is
The objectives for and are:
- Discriminator loss: ;
- Generator loss: .
The adversarial training thus alternates between
consistent with standard GAN formulations but adapted to the cross-modal and hashing context.
3. Correlation Graph and Manifold Distribution
UGACH explicitly captures cross-modal manifold structure through correlation graphs. For each modality, a -nearest-neighbor (k-NN) graph is constructed:
with indexing the data instances and binary adjacency matrices indicating neighbor relationships:
A query ’s manifold neighbor distribution is
This structure induces manifold-consistent cross-modal pairs via paired alignments in the dataset.
4. Hash Function Optimization and Binary-Code Learning
Binary hash codes are produced from the continuous outputs by
During training, a squared Euclidean distance surrogate is used for Hamming distance:
The triplet regularization in ,
enforces that “true” manifold pairs are mapped closer in code space than negative or generated pairs. Discrete codes are used at test time for fast retrieval.
5. Training Algorithm
Training proceeds by alternating updates:
- Initialize model parameters .
- For each epoch:
- With fixed, update using stochastic gradient ascent on the objective combining true manifold and generated pairs.
- With fixed, update using REINFORCE for discrete sampling:
- The term acts as a reward signal for generator updates.
- Learning rates are annealed by a factor of 10 every two epochs.
- After convergence, the generative model is discarded; the discriminative model ’s hashing layers are retained for code generation.
6. Experimental Protocol and Empirical Results
UGACH was evaluated on NUS-WIDE (186,557 image–tag pairs, 10 major concepts, 1% query holdout) and MIRFlickr (25,000 pairs, 5% holdout). Image features used 4,096-D VGG19; text used 1,000-D BoW. Competitors included state-of-the-art unsupervised methods (CVH, PDH, CMFH, CCQ), supervised baselines (CMSSH, SCM), and ablations (triplet only; GAN without graph).
Standard retrieval metrics were employed: mean average precision (MAP) at multiple code lengths ( bits), precision-recall curves, and Precision@K (128 bits).
Key results for NUS-WIDE:
- UGACH: MAP(imagetext) ≈ 0.624, MAP(textimage) ≈ 0.625
- Best prior unsupervised: CCQ (0.505 / 0.494); MAP +0.119 / +0.131
- Best supervised: SCM_seq (0.517 / 0.516); MAP +0.107 / +0.109
- Ablations indicated progressive gains: GAN adds +0.04, graph adds +0.02 MAP.
Key results for MIRFlickr:
- UGACH: MAP(imagetext) ≈ 0.696, MAP(textimage) ≈ 0.681
- Prior unsupervised: CMFH/CCQ (0.663 / 0.639); gains +0.033 / +0.058
- Outperformed all supervised baselines on most metrics.
Qualitative analyses showed UGACH consistently producing higher Precision@K and improved PR curves in both cross-modal retrieval directions (Zhang et al., 2017).
7. Conceptual Synthesis and Significance
UGACH integrates a graph-based manifold prior (via k-NN adjacency) with a cross-modal generative adversarial network, where the generator selects “hard” negative examples respecting the empirical manifold structure and the discriminator employs triplet ranking to enforce discriminability in the learned Hamming space. The approach is entirely unsupervised and does not require label annotations. Experimental evidence demonstrates significant MAP improvements over both unsupervised and supervised alternatives on standard benchmarks, with consistent superiority in quantitative and qualitative retrieval measures (Zhang et al., 2017). This suggests robust potential for unsupervised cross-modal retrieval settings where annotation is impractical.