UGACH: Cross-Modal Retrieval via GANs

Updated 19 November 2025

The paper introduces UGACH, integrating GANs with a k-NN correlation graph to learn unified Hamming representations, achieving up to +0.13 MAP improvement.
It employs a generative model to sample 'hard' negatives and a discriminative triplet loss to enforce manifold consistency across modalities.
Experimental results on NUS-WIDE and MIRFlickr show UGACH outperforms both unsupervised and supervised methods in cross-modal retrieval tasks.

Unsupervised Generative Adversarial Cross-modal Hashing (UGACH) is an unsupervised framework for cross-modal retrieval, mapping heterogeneous multimedia data—such as images and texts—into a unified Hamming space. UGACH leverages generative adversarial networks (GANs) to exploit the underlying manifold structure across different modalities, enabling efficient cross-modal hashing and retrieval without annotation. The approach combines generative adversarial learning with a correlation-graph-based manifold prior, yielding superior retrieval performance relative to both state-of-the-art unsupervised and supervised methods (Zhang et al., 2017).

1. Architectural Components

UGACH consists of two primary modules: a generative model (G) and a discriminative model (D), each operating on two parallel pathways—one for each modality (image and text). The feature extraction pipeline represents each modality’s data $x$ via high-dimensional descriptors (e.g., 4,096-D VGG features for images, 1,000-D Bag-of-Words for text). Both pathways employ a common-representation layer

$\phi_c(x) = \tanh(W_c x + b_c),$

followed by a hashing layer (continuous relaxation)

$h(x) = \sigma(W_h \phi_c(x) + b_h),$

where $\sigma$ is the sigmoid function. Binary hash codes for retrieval are obtained at inference as

$b(x) = \operatorname{sgn}(h(x) - 0.5).$

Given a query $q$ in one modality, the generative model $G$ uses the hashing outputs to define a discrete probability distribution for selecting a cross-modal sample $x^G$ :

$p_\theta(x^G|q) = \frac{\exp(-\|h(q) - h(x^G)\|^2)}{\sum_{x'} \exp(-\|h(q) - h(x')\|^2)}.$

This mechanism highlights “hard” negatives—cross-modal samples plausible under the (unknown) manifold structure.

The discriminative model $D$ learns to score query-candidate pairs by a triplet-based function

$f_\phi(x, q) = \max(0,\, m + \|h(q) - h(x^M)\|^2 - \|h(q) - h(x)\|^2 ),$

where $x^M$ denotes a “true” manifold neighbor (see Section 3), and $m=1$ . The discrimination is finalized with a sigmoid $D(x|q) = \sigma(f_\phi(x, q))$ , promoting proximity in Hamming space for true manifold pairs and discrimination against $G$ ’s samples.

2. Minimax Adversarial Objective

UGACH is trained by a minimax adversarial game. Let $p_{\mathrm{true}}(x^M|q)$ denote the empirical neighbor-based manifold distribution. The value function is

$V(G, D) = \sum_{j=1}^n \left[ \mathbb{E}_{x \sim p_{\mathrm{true}}(\cdot | q^j)} [\log D(x|q^j)] + \mathbb{E}_{x \sim p_\theta(\cdot|q^j)} [\log(1 - D(x|q^j))] \right].$

The objectives for $G$ and $D$ are:

Discriminator loss: $L_D = -V(G, D)$ ;
Generator loss: $L_G = \sum_j \mathbb{E}_{x \sim p_\theta(\cdot|q^j)} [\log(1 - D(x|q^j))]$ .

The adversarial training thus alternates between

$\min_G L_G \qquad\textrm{vs.}\qquad \min_D L_D,$

consistent with standard GAN formulations but adapted to the cross-modal and hashing context.

3. Correlation Graph and Manifold Distribution

UGACH explicitly captures cross-modal manifold structure through correlation graphs. For each modality, a $k$ -nearest-neighbor (k-NN) graph is constructed:

$\operatorname{Graph}_i = (V, W_i), \quad \operatorname{Graph}_t = (V, W_t),$

with $V$ indexing the $n$ data instances and binary adjacency matrices indicating neighbor relationships:

$w_i(p, q) = \begin{cases} 1 & \text{if } x_p \in NN_k(x_q) \ 0 & \text{otherwise} \end{cases}.$

A query $q^j$ ’s manifold neighbor distribution is

$p_{\mathrm{true}}(x^M|q^j) = \text{Uniform}(\{x_k : w(k, j) = 1\}).$

This structure induces manifold-consistent cross-modal pairs via paired alignments in the dataset.

4. Hash Function Optimization and Binary-Code Learning

Binary hash codes are produced from the continuous outputs by

$b(x) = \operatorname{sgn}(h(x) - 0.5), \quad b(x) \in \{-1, +1\}^\ell.$

During training, a squared Euclidean distance surrogate is used for Hamming distance:

$d(q, x) = \|h(q) - h(x)\|^2.$

The triplet regularization in $D$ ,

$f_\phi(x, q) = \max(0, m + d(q, x^M) - d(q, x)),$

enforces that “true” manifold pairs are mapped closer in code space than negative or generated pairs. Discrete codes are used at test time for fast retrieval.

5. Training Algorithm

Training proceeds by alternating updates:

Initialize model parameters $(\theta, \phi)$ .
For each epoch:
- With $G$ fixed, update $D$ using stochastic gradient ascent on the objective combining true manifold and generated pairs.
- With $D$ fixed, update $G$ using REINFORCE for discrete sampling:
$\nabla_\theta \mathbb{E}_{x \sim p_\theta}[ \log (1 - D(x|q)) ] \approx \frac{1}{m} \sum_{k=1}^m \nabla_\theta \log p_\theta(x_k|q) \cdot \log(1 + e^{f_\phi(x_k, q)}).$

The term $\log(1 + e^{f_\phi(\cdot)})$ acts as a reward signal for generator updates.
Learning rates are annealed by a factor of 10 every two epochs.

After convergence, the generative model $G$ is discarded; the discriminative model $D$ ’s hashing layers are retained for code generation.

6. Experimental Protocol and Empirical Results

UGACH was evaluated on NUS-WIDE (186,557 image–tag pairs, 10 major concepts, 1% query holdout) and MIRFlickr (25,000 pairs, 5% holdout). Image features used 4,096-D VGG19; text used 1,000-D BoW. Competitors included state-of-the-art unsupervised methods (CVH, PDH, CMFH, CCQ), supervised baselines (CMSSH, SCM), and ablations (triplet only; GAN without graph).

Standard retrieval metrics were employed: mean average precision (MAP) at multiple code lengths ( $\{16, 32, 64, 128\}$ bits), precision-recall curves, and Precision@K (128 bits).

Key results for NUS-WIDE:

UGACH: MAP(image $\to$ text) ≈ 0.624, MAP(text $\to$ image) ≈ 0.625
Best prior unsupervised: CCQ (0.505 / 0.494); $\Delta$ MAP +0.119 / +0.131
Best supervised: SCM_seq (0.517 / 0.516); $\Delta$ MAP +0.107 / +0.109
Ablations indicated progressive gains: GAN adds +0.04, graph adds +0.02 MAP.

Key results for MIRFlickr:

UGACH: MAP(image $\to$ text) ≈ 0.696, MAP(text $\to$ image) ≈ 0.681
Prior unsupervised: CMFH/CCQ ( $\sim$ 0.663 / 0.639); gains +0.033 / +0.058
Outperformed all supervised baselines on most metrics.

Qualitative analyses showed UGACH consistently producing higher Precision@K and improved PR curves in both cross-modal retrieval directions (Zhang et al., 2017).

7. Conceptual Synthesis and Significance

UGACH integrates a graph-based manifold prior (via k-NN adjacency) with a cross-modal generative adversarial network, where the generator selects “hard” negative examples respecting the empirical manifold structure and the discriminator employs triplet ranking to enforce discriminability in the learned Hamming space. The approach is entirely unsupervised and does not require label annotations. Experimental evidence demonstrates significant MAP improvements over both unsupervised and supervised alternatives on standard benchmarks, with consistent superiority in quantitative and qualitative retrieval measures (Zhang et al., 2017). This suggests robust potential for unsupervised cross-modal retrieval settings where annotation is impractical.

PDF Markdown Chat (Pro)

References (1)

Unsupervised Generative Adversarial Cross-modal Hashing (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to UGACH.