Papers
Topics
Authors
Recent
Search
2000 character limit reached

SPDQ: Shared Predictive Deep Quantization

Updated 29 January 2026
  • SPDQ is a supervised deep quantization method that decomposes features into shared and modality-specific subspaces for cross-modal similarity search.
  • It uses RKHS-based Maximum Mean Discrepancy to align shared representations, ensuring semantic consistency across image and text modalities.
  • Joint optimization with label-aligned additive quantization significantly reduces quantization error, improving retrieval performance and MAP scores.

Shared Predictive Deep Quantization (SPDQ) is a supervised deep quantization approach designed for cross-modal similarity search, particularly for retrieving semantically related items across heterogeneous modalities such as images and text. SPDQ introduces a dual-stream neural network architecture that explicitly decomposes feature representations into shared and modality-private subspaces. These latent representations are embedded into a reproducing-kernel Hilbert space and regularized for cross-modal alignment through Maximum Mean Discrepancy (MMD). In the shared subspace, SPDQ employs label-aligned additive quantization to produce compact, semantics-preserving codes suitable for efficient large-scale retrieval. Joint optimization integrates representation learning and supervised quantizer training, yielding significantly reduced quantization error and enhanced retrieval performance relative to prior state-of-the-art methods (Yang et al., 2019).

1. Dual-Stream Network Architecture: Shared and Private Subspaces

SPDQ operates on paired cross-modal datasets, consisting of NN image–text pairs with KK semantic classes. Each modality is processed by a distinct, yet architecturally parallel, deep network branch:

  • The image branch (e.g., AlexNet up to fully connected layers) bifurcates into:
    • Shared-space unit: A fully-connected layer with tanh\tanh activation yielding siRdss_i \in \mathbb{R}^{d_s}.
    • Image-private unit: A fully-connected layer with tanh\tanh activation yielding riRdrr_i \in \mathbb{R}^{d_r}.
  • The text branch (three-layer MLP) mirrors this structure:
    • Shared-space unit: Output stRdss_t \in \mathbb{R}^{d_s}.
    • Text-private unit: Output rtRdrr_t \in \mathbb{R}^{d_r}.

For each data pair, embeddings (si,ri)(s_i, r_i) and (st,rt)(s_t, r_t) are obtained for images and texts, respectively, leading to global representations SiRds×NS_i \in \mathbb{R}^{d_s \times N}, RiRdr×NR_i \in \mathbb{R}^{d_r \times N} (images) and StS_t, RtR_t (texts). This explicit decomposition enables SPDQ to model modality-invariant semantics in the shared space while capturing modality-specific information in the private spaces.

2. RKHS Embedding and Cross-Modal Alignment with Maximum Mean Discrepancy

To enforce semantic alignment of the shared representations and separation of the private representations, SPDQ employs the Maximum Mean Discrepancy (MMD) criterion within a reproducing-kernel Hilbert space (RKHS). Using a composite multi-kernel k=uβukuk = \sum_u \beta_u k_u (typically a convex combination of Gaussians), the squared MMD between distributions pp and qq is:

dk2(p,q)=Exp[ϕ(x)]Eyq[ϕ(y)]Hk2.d_k^2(p, q) = \left\| \mathbb{E}_{x \sim p}[\phi(x)] - \mathbb{E}_{y \sim q}[\phi(y)] \right\|_{\mathcal{H}_k}^2.

For each class cc, the loss terms encourage the means of shared representations si(c)s_i^{(c)} and st(c)s_t^{(c)} to coincide (semantic invariance), while maximizing the discrepancy between private representations ri(c)r_i^{(c)} and rt(c)r_t^{(c)} (modality separation):

L1=c=1K[E[ϕ(si(c))]E[ϕ(st(c))]Hk2E[ϕ(ri(c))]E[ϕ(rt(c))]Hk2].\mathcal{L}_1 = \sum_{c=1}^K \Bigl[ \|\mathbb{E}[\phi(s_i^{(c)})] - \mathbb{E}[\phi(s_t^{(c)})] \|_{\mathcal{H}_k}^2 - \| \mathbb{E}[\phi(r_i^{(c)})] - \mathbb{E}[\phi(r_t^{(c)})] \|_{\mathcal{H}_k}^2 \Bigr].

Discriminability is further enforced by concatenating hi=[si;ri]h_i = [s_i; r_i], ht=[st;rt]h_t = [s_t; r_t] and applying a KK-way sigmoid cross-entropy loss over labels LL, yielding

L2=Lc(hi,L)+Lc(ht,L).\mathcal{L}_2 = \mathcal{L}_c(h_i, L) + \mathcal{L}_c(h_t, L).

3. Supervised Additive Quantization with Label Alignment

After extracting shared representations, SPDQ introduces a quantization module with class-based label alignment:

  • Projection matrices: Ci,CtRd×dsC_i, C_t \in \mathbb{R}^{d_\ell \times d_s} (orthonormal: CiCi=IC_i C_i^\top = I).
  • Class prototype codebook: ZRd×KZ \in \mathbb{R}^{d_\ell \times K}.
  • Label-aligned representation: ZLRd×NZ L \in \mathbb{R}^{d_\ell \times N}.

Alignment loss enforces congruence between projected shared features and their class prototypes:

L3=CiSiZLF2+CtStZLF2,subject to CiCi=I,  CtCt=I.\mathcal{L}_3 = \| C_i S_i - ZL \|_F^2 + \| C_t S_t - ZL \|_F^2, \quad \text{subject to } C_i C_i^\top = I, \; C_t C_t^\top = I.

Additive quantization is applied over ZLZL via MM sub-dictionaries D=[D1,...,DM]D = [D_1, ..., D_M], each of dimension d×Kd_\ell \times K, and associated code assignments BB. The quantization loss is

L4=ZLDBF2.\mathcal{L}_4 = \|ZL - DB \|_F^2.

The label alignment and quantization losses are jointly weighted:

Oq=L3+βL4,\mathcal{O}_q = \mathcal{L}_3 + \beta \mathcal{L}_4,

with hyperparameter β>0\beta > 0.

4. End-to-End Objective and Optimization

The combined objective function is

O=L1+αL2+λ(L3+βL4),subject to CiCi=I,  CtCt=I,\mathcal{O} = \mathcal{L}_1 + \alpha\mathcal{L}_2 + \lambda(\mathcal{L}_3 + \beta\mathcal{L}_4), \quad \text{subject to } C_i C_i^\top = I, \; C_t C_t^\top = I,

with α\alpha controlling classification guidance and λ\lambda balancing between representation and quantization.

Optimization alternates between two blocks:

  • A. Deep network update: Parameters are updated via mini-batch SGD, backpropagating through the RKHS-MMD alignment, classification, and quantizer alignment losses.
  • B. Quantization module update: Given frozen network outputs (Si,St)(S_i, S_t), solve the subproblems:

    1. CiC_i, CtC_t through orthogonal Procrustes (using SVD).
    2. ZZ in closed form.
    3. DD via least-squares.
    4. BB by Iterated Conditional Modes (ICM) search with exhaustive enumeration over subcodes.

This iterative scheme continues until convergence.

Training Pipeline Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Input:  X_i, X_t, L, code length M·log₂K, α, β, λ, batch size B.
Initialize: network weights, C_i, C_t, Z, D, B.

repeat
    // Deep network update (A)
    Sample minibatch of B image–text–label triples
    Forward: CNN/MLP → {s_i, s_t, r_i, r_t}
    Compute losses: ℒ₁, ℒ₂, ℒ₃, ℒ₄
    𝒪_batch = ℒ₁ + αℒ₂ + λ(ℒ₃ + βℒ₄)
    Backpropagate, SGD-step on weights

    // Quantizer update (B)
    Compute S_i, S_t on all N points
    1) C_i ← OrthogonalProcrustes(S_i, ZL)
    2) C_t ← OrthogonalProcrustes(S_t, ZL)
    3) Z  ← [(C_i S_i + C_t S_t)+β D B] Lᵀ [(2+β)(L Lᵀ)]⁻¹
    4) D  ← (Z L Bᵀ)(B Bᵀ)⁻¹
    5) For n=1…N, update b_n by ICM over M sub-dictionaries
until convergence

Output: final network, C_i, C_t, Z, D, B.

At query time, an image (or text) is embedded and projected; retrieval relies on efficient Asymmetric Quantizer Distance (AQD) scoring in the shared subspace.

5. Retrieval Process

During inference, only the shared subspace is used. For a query image xi(q)x_i^{(q)}:

  1. Pass through CNN to obtain si(q)s_i^{(q)}.

  2. Project: Cisi(q)RdC_i s_i^{(q)} \in \mathbb{R}^{d_\ell}.
  3. Retrieve nearest text codes encoded as DbdD b_d by AQD:

AQD(xi(q),xt(d))=(Cisi(q))(Dbd)AQD(x_i^{(q)}, x_t^{(d)}) = (C_i s_i^{(q)})^\top (D b_d)

Scoring is accelerated via precomputed lookup tables of inner products for fast large-scale search.

6. Empirical Performance and Comparative Analysis

Experiments on FLICKR25K and NUS-WIDE demonstrate that SPDQ surpasses previous deep quantization methods, including CDQ, by 3–5 mean average precision (MAP) points, with pronounced gains in low-bit regimes (e.g., 16/32 bits). Top-NN precision metrics are also significantly improved for small NN, indicating effective alignment and quantization.

The following table summarizes principal innovations and their empirical consequences:

SPDQ Design Feature Mechanism Empirical Outcome
Shared/Private Decomposition Dual subspace network heads Tighter inter-modal alignment
MK-MMD Loss RKHS distribution matching; modality separation Lower retrieval error
Label Alignment Collapse classes to prototypes before quantization Reduced intra-class variance
Supervised Additive Quantization Joint codebook learning in shared subspace Improved MAP, precision

SPDQ’s explicit RKHS-based subspace modeling, strong label alignment, and integrated quantizer learning collectively underpin both its theoretical robustness and leading empirical results (Yang et al., 2019).

7. Methodological Significance

Shared Predictive Deep Quantization exemplifies a cohesive integration of several advanced concepts in cross-modal retrieval:

  • Subspace decomposition enhances semantic fidelity across modalities by explicitly modeling shared and modality-specific factors.
  • RKHS-based MMD alignment is more flexible than Euclidean similarity constraints, as it matches entire distributions within each semantic class.
  • Label alignment ensures robust representation collapse, directly optimizing for retrieval by minimizing intra-class spread before quantization.
  • End-to-end learning ensures that feature extraction and quantizer adaptation are synergistically optimized, contrasting favorably with two-stage or decoupled frameworks.

A plausible implication is that deep quantization approaches employing multi-granular alignment (distributional, label-aligned, and quantization error-aware) are likely to exhibit superior performance in large-scale, heterogeneous cross-modal retrieval settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shared Predictive Deep Quantization (SPDQ).