Papers
Topics
Authors
Recent
2000 character limit reached

UGACH: Cross-Modal Retrieval via GANs

Updated 19 November 2025
  • The paper introduces UGACH, integrating GANs with a k-NN correlation graph to learn unified Hamming representations, achieving up to +0.13 MAP improvement.
  • It employs a generative model to sample 'hard' negatives and a discriminative triplet loss to enforce manifold consistency across modalities.
  • Experimental results on NUS-WIDE and MIRFlickr show UGACH outperforms both unsupervised and supervised methods in cross-modal retrieval tasks.

Unsupervised Generative Adversarial Cross-modal Hashing (UGACH) is an unsupervised framework for cross-modal retrieval, mapping heterogeneous multimedia data—such as images and texts—into a unified Hamming space. UGACH leverages generative adversarial networks (GANs) to exploit the underlying manifold structure across different modalities, enabling efficient cross-modal hashing and retrieval without annotation. The approach combines generative adversarial learning with a correlation-graph-based manifold prior, yielding superior retrieval performance relative to both state-of-the-art unsupervised and supervised methods (Zhang et al., 2017).

1. Architectural Components

UGACH consists of two primary modules: a generative model (G) and a discriminative model (D), each operating on two parallel pathways—one for each modality (image and text). The feature extraction pipeline represents each modality’s data xx via high-dimensional descriptors (e.g., 4,096-D VGG features for images, 1,000-D Bag-of-Words for text). Both pathways employ a common-representation layer

ϕc(x)=tanh(Wcx+bc),\phi_c(x) = \tanh(W_c x + b_c),

followed by a hashing layer (continuous relaxation)

h(x)=σ(Whϕc(x)+bh),h(x) = \sigma(W_h \phi_c(x) + b_h),

where σ\sigma is the sigmoid function. Binary hash codes for retrieval are obtained at inference as

b(x)=sgn(h(x)0.5).b(x) = \operatorname{sgn}(h(x) - 0.5).

Given a query qq in one modality, the generative model GG uses the hashing outputs to define a discrete probability distribution for selecting a cross-modal sample xGx^G:

pθ(xGq)=exp(h(q)h(xG)2)xexp(h(q)h(x)2).p_\theta(x^G|q) = \frac{\exp(-\|h(q) - h(x^G)\|^2)}{\sum_{x'} \exp(-\|h(q) - h(x')\|^2)}.

This mechanism highlights “hard” negatives—cross-modal samples plausible under the (unknown) manifold structure.

The discriminative model DD learns to score query-candidate pairs by a triplet-based function

fϕ(x,q)=max(0,m+h(q)h(xM)2h(q)h(x)2),f_\phi(x, q) = \max(0,\, m + \|h(q) - h(x^M)\|^2 - \|h(q) - h(x)\|^2 ),

where xMx^M denotes a “true” manifold neighbor (see Section 3), and m=1m=1. The discrimination is finalized with a sigmoid D(xq)=σ(fϕ(x,q))D(x|q) = \sigma(f_\phi(x, q)), promoting proximity in Hamming space for true manifold pairs and discrimination against GG’s samples.

2. Minimax Adversarial Objective

UGACH is trained by a minimax adversarial game. Let ptrue(xMq)p_{\mathrm{true}}(x^M|q) denote the empirical neighbor-based manifold distribution. The value function is

V(G,D)=j=1n[Exptrue(qj)[logD(xqj)]+Expθ(qj)[log(1D(xqj))]].V(G, D) = \sum_{j=1}^n \left[ \mathbb{E}_{x \sim p_{\mathrm{true}}(\cdot | q^j)} [\log D(x|q^j)] + \mathbb{E}_{x \sim p_\theta(\cdot|q^j)} [\log(1 - D(x|q^j))] \right].

The objectives for GG and DD are:

  • Discriminator loss: LD=V(G,D)L_D = -V(G, D);
  • Generator loss: LG=jExpθ(qj)[log(1D(xqj))]L_G = \sum_j \mathbb{E}_{x \sim p_\theta(\cdot|q^j)} [\log(1 - D(x|q^j))].

The adversarial training thus alternates between

minGLGvs.minDLD,\min_G L_G \qquad\textrm{vs.}\qquad \min_D L_D,

consistent with standard GAN formulations but adapted to the cross-modal and hashing context.

3. Correlation Graph and Manifold Distribution

UGACH explicitly captures cross-modal manifold structure through correlation graphs. For each modality, a kk-nearest-neighbor (k-NN) graph is constructed:

Graphi=(V,Wi),Grapht=(V,Wt),\operatorname{Graph}_i = (V, W_i), \quad \operatorname{Graph}_t = (V, W_t),

with VV indexing the nn data instances and binary adjacency matrices indicating neighbor relationships:

wi(p,q)={1if xpNNk(xq) 0otherwise.w_i(p, q) = \begin{cases} 1 & \text{if } x_p \in NN_k(x_q) \ 0 & \text{otherwise} \end{cases}.

A query qjq^j’s manifold neighbor distribution is

ptrue(xMqj)=Uniform({xk:w(k,j)=1}).p_{\mathrm{true}}(x^M|q^j) = \text{Uniform}(\{x_k : w(k, j) = 1\}).

This structure induces manifold-consistent cross-modal pairs via paired alignments in the dataset.

4. Hash Function Optimization and Binary-Code Learning

Binary hash codes are produced from the continuous outputs by

b(x)=sgn(h(x)0.5),b(x){1,+1}.b(x) = \operatorname{sgn}(h(x) - 0.5), \quad b(x) \in \{-1, +1\}^\ell.

During training, a squared Euclidean distance surrogate is used for Hamming distance:

d(q,x)=h(q)h(x)2.d(q, x) = \|h(q) - h(x)\|^2.

The triplet regularization in DD,

fϕ(x,q)=max(0,m+d(q,xM)d(q,x)),f_\phi(x, q) = \max(0, m + d(q, x^M) - d(q, x)),

enforces that “true” manifold pairs are mapped closer in code space than negative or generated pairs. Discrete codes are used at test time for fast retrieval.

5. Training Algorithm

Training proceeds by alternating updates:

  1. Initialize model parameters (θ,ϕ)(\theta, \phi).
  2. For each epoch:

    • With GG fixed, update DD using stochastic gradient ascent on the objective combining true manifold and generated pairs.
    • With DD fixed, update GG using REINFORCE for discrete sampling:

    θExpθ[log(1D(xq))]1mk=1mθlogpθ(xkq)log(1+efϕ(xk,q)).\nabla_\theta \mathbb{E}_{x \sim p_\theta}[ \log (1 - D(x|q)) ] \approx \frac{1}{m} \sum_{k=1}^m \nabla_\theta \log p_\theta(x_k|q) \cdot \log(1 + e^{f_\phi(x_k, q)}).

  • The term log(1+efϕ())\log(1 + e^{f_\phi(\cdot)}) acts as a reward signal for generator updates.
  • Learning rates are annealed by a factor of 10 every two epochs.
  1. After convergence, the generative model GG is discarded; the discriminative model DD’s hashing layers are retained for code generation.

6. Experimental Protocol and Empirical Results

UGACH was evaluated on NUS-WIDE (186,557 image–tag pairs, 10 major concepts, 1% query holdout) and MIRFlickr (25,000 pairs, 5% holdout). Image features used 4,096-D VGG19; text used 1,000-D BoW. Competitors included state-of-the-art unsupervised methods (CVH, PDH, CMFH, CCQ), supervised baselines (CMSSH, SCM), and ablations (triplet only; GAN without graph).

Standard retrieval metrics were employed: mean average precision (MAP) at multiple code lengths ({16,32,64,128}\{16, 32, 64, 128\} bits), precision-recall curves, and Precision@K (128 bits).

Key results for NUS-WIDE:

  • UGACH: MAP(image\totext) ≈ 0.624, MAP(text\toimage) ≈ 0.625
  • Best prior unsupervised: CCQ (0.505 / 0.494); Δ\DeltaMAP +0.119 / +0.131
  • Best supervised: SCM_seq (0.517 / 0.516); Δ\DeltaMAP +0.107 / +0.109
  • Ablations indicated progressive gains: GAN adds +0.04, graph adds +0.02 MAP.

Key results for MIRFlickr:

  • UGACH: MAP(image\totext) ≈ 0.696, MAP(text\toimage) ≈ 0.681
  • Prior unsupervised: CMFH/CCQ (\sim0.663 / 0.639); gains +0.033 / +0.058
  • Outperformed all supervised baselines on most metrics.

Qualitative analyses showed UGACH consistently producing higher Precision@K and improved PR curves in both cross-modal retrieval directions (Zhang et al., 2017).

7. Conceptual Synthesis and Significance

UGACH integrates a graph-based manifold prior (via k-NN adjacency) with a cross-modal generative adversarial network, where the generator selects “hard” negative examples respecting the empirical manifold structure and the discriminator employs triplet ranking to enforce discriminability in the learned Hamming space. The approach is entirely unsupervised and does not require label annotations. Experimental evidence demonstrates significant MAP improvements over both unsupervised and supervised alternatives on standard benchmarks, with consistent superiority in quantitative and qualitative retrieval measures (Zhang et al., 2017). This suggests robust potential for unsupervised cross-modal retrieval settings where annotation is impractical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UGACH.