SPDQ: Shared Predictive Deep Quantization

Updated 29 January 2026

SPDQ is a supervised deep quantization method that decomposes features into shared and modality-specific subspaces for cross-modal similarity search.
It uses RKHS-based Maximum Mean Discrepancy to align shared representations, ensuring semantic consistency across image and text modalities.
Joint optimization with label-aligned additive quantization significantly reduces quantization error, improving retrieval performance and MAP scores.

Shared Predictive Deep Quantization (SPDQ) is a supervised deep quantization approach designed for cross-modal similarity search, particularly for retrieving semantically related items across heterogeneous modalities such as images and text. SPDQ introduces a dual-stream neural network architecture that explicitly decomposes feature representations into shared and modality-private subspaces. These latent representations are embedded into a reproducing-kernel Hilbert space and regularized for cross-modal alignment through Maximum Mean Discrepancy (MMD). In the shared subspace, SPDQ employs label-aligned additive quantization to produce compact, semantics-preserving codes suitable for efficient large-scale retrieval. Joint optimization integrates representation learning and supervised quantizer training, yielding significantly reduced quantization error and enhanced retrieval performance relative to prior state-of-the-art methods (Yang et al., 2019).

1. Dual-Stream Network Architecture: Shared and Private Subspaces

SPDQ operates on paired cross-modal datasets, consisting of $N$ image–text pairs with $K$ semantic classes. Each modality is processed by a distinct, yet architecturally parallel, deep network branch:

The image branch (e.g., AlexNet up to fully connected layers) bifurcates into:
- Shared-space unit: A fully-connected layer with $\tanh$ activation yielding $s_i \in \mathbb{R}^{d_s}$ .
- Image-private unit: A fully-connected layer with $\tanh$ activation yielding $r_i \in \mathbb{R}^{d_r}$ .
The text branch (three-layer MLP) mirrors this structure:
- Shared-space unit: Output $s_t \in \mathbb{R}^{d_s}$ .
- Text-private unit: Output $r_t \in \mathbb{R}^{d_r}$ .

For each data pair, embeddings $(s_i, r_i)$ and $(s_t, r_t)$ are obtained for images and texts, respectively, leading to global representations $S_i \in \mathbb{R}^{d_s \times N}$ , $R_i \in \mathbb{R}^{d_r \times N}$ (images) and $S_t$ , $R_t$ (texts). This explicit decomposition enables SPDQ to model modality-invariant semantics in the shared space while capturing modality-specific information in the private spaces.

To enforce semantic alignment of the shared representations and separation of the private representations, SPDQ employs the Maximum Mean Discrepancy (MMD) criterion within a reproducing-kernel Hilbert space (RKHS). Using a composite multi-kernel $k = \sum_u \beta_u k_u$ (typically a convex combination of Gaussians), the squared MMD between distributions $p$ and $q$ is:

$d_k^2(p, q) = \left\| \mathbb{E}_{x \sim p}[\phi(x)] - \mathbb{E}_{y \sim q}[\phi(y)] \right\|_{\mathcal{H}_k}^2.$

For each class $c$ , the loss terms encourage the means of shared representations $s_i^{(c)}$ and $s_t^{(c)}$ to coincide (semantic invariance), while maximizing the discrepancy between private representations $r_i^{(c)}$ and $r_t^{(c)}$ (modality separation):

$\mathcal{L}_1 = \sum_{c=1}^K \Bigl[ \|\mathbb{E}[\phi(s_i^{(c)})] - \mathbb{E}[\phi(s_t^{(c)})] \|_{\mathcal{H}_k}^2 - \| \mathbb{E}[\phi(r_i^{(c)})] - \mathbb{E}[\phi(r_t^{(c)})] \|_{\mathcal{H}_k}^2 \Bigr].$

Discriminability is further enforced by concatenating $h_i = [s_i; r_i]$ , $h_t = [s_t; r_t]$ and applying a $K$ -way sigmoid cross-entropy loss over labels $L$ , yielding

$\mathcal{L}_2 = \mathcal{L}_c(h_i, L) + \mathcal{L}_c(h_t, L).$

3. Supervised Additive Quantization with Label Alignment

After extracting shared representations, SPDQ introduces a quantization module with class-based label alignment:

Projection matrices: $C_i, C_t \in \mathbb{R}^{d_\ell \times d_s}$ (orthonormal: $C_i C_i^\top = I$ ).
Class prototype codebook: $Z \in \mathbb{R}^{d_\ell \times K}$ .
Label-aligned representation: $Z L \in \mathbb{R}^{d_\ell \times N}$ .

Alignment loss enforces congruence between projected shared features and their class prototypes:

$\mathcal{L}_3 = \| C_i S_i - ZL \|_F^2 + \| C_t S_t - ZL \|_F^2, \quad \text{subject to } C_i C_i^\top = I, \; C_t C_t^\top = I.$

Additive quantization is applied over $ZL$ via $M$ sub-dictionaries $D = [D_1, ..., D_M]$ , each of dimension $d_\ell \times K$ , and associated code assignments $B$ . The quantization loss is

$\mathcal{L}_4 = \|ZL - DB \|_F^2.$

The label alignment and quantization losses are jointly weighted:

$\mathcal{O}_q = \mathcal{L}_3 + \beta \mathcal{L}_4,$

with hyperparameter $\beta > 0$ .

4. End-to-End Objective and Optimization

The combined objective function is

$\mathcal{O} = \mathcal{L}_1 + \alpha\mathcal{L}_2 + \lambda(\mathcal{L}_3 + \beta\mathcal{L}_4), \quad \text{subject to } C_i C_i^\top = I, \; C_t C_t^\top = I,$

with $\alpha$ controlling classification guidance and $\lambda$ balancing between representation and quantization.

Optimization alternates between two blocks:

A. Deep network update: Parameters are updated via mini-batch SGD, backpropagating through the RKHS-MMD alignment, classification, and quantizer alignment losses.
B. Quantization module update: Given frozen network outputs $(S_i, S_t)$ $(S_{i}, S_{t})$ , solve the subproblems:
1. $C_i$ , $C_t$ through orthogonal Procrustes (using SVD).
2. $Z$ in closed form.
3. $D$ via least-squares.
4. $B$ by Iterated Conditional Modes (ICM) search with exhaustive enumeration over subcodes.

This iterative scheme continues until convergence.

Training Pipeline Pseudocode

Input:  X_i, X_t, L, code length M·log₂K, α, β, λ, batch size B.
Initialize: network weights, C_i, C_t, Z, D, B.

repeat
    // Deep network update (A)
    Sample minibatch of B image–text–label triples
    Forward: CNN/MLP → {s_i, s_t, r_i, r_t}
    Compute losses: ℒ₁, ℒ₂, ℒ₃, ℒ₄
    𝒪_batch = ℒ₁ + αℒ₂ + λ(ℒ₃ + βℒ₄)
    Backpropagate, SGD-step on weights

    // Quantizer update (B)
    Compute S_i, S_t on all N points
    1) C_i ← OrthogonalProcrustes(S_i, ZL)
    2) C_t ← OrthogonalProcrustes(S_t, ZL)
    3) Z  ← [(C_i S_i + C_t S_t)+β D B] Lᵀ [(2+β)(L Lᵀ)]⁻¹
    4) D  ← (Z L Bᵀ)(B Bᵀ)⁻¹
    5) For n=1…N, update b_n by ICM over M sub-dictionaries
until convergence

Output: final network, C_i, C_t, Z, D, B.

At query time, an image (or text) is embedded and projected; retrieval relies on efficient Asymmetric Quantizer Distance (AQD) scoring in the shared subspace.

5. Retrieval Process

During inference, only the shared subspace is used. For a query image $x_i^{(q)}$ :

Pass through CNN to obtain $s_i^{(q)}$ .
Project: $C_i s_i^{(q)} \in \mathbb{R}^{d_\ell}$ .
Retrieve nearest text codes encoded as $D b_d$ by AQD:

$AQD(x_i^{(q)}, x_t^{(d)}) = (C_i s_i^{(q)})^\top (D b_d)$

Scoring is accelerated via precomputed lookup tables of inner products for fast large-scale search.

6. Empirical Performance and Comparative Analysis

Experiments on FLICKR25K and NUS-WIDE demonstrate that SPDQ surpasses previous deep quantization methods, including CDQ, by 3–5 mean average precision (MAP) points, with pronounced gains in low-bit regimes (e.g., 16/32 bits). Top- $N$ precision metrics are also significantly improved for small $N$ , indicating effective alignment and quantization.

The following table summarizes principal innovations and their empirical consequences:

SPDQ Design Feature	Mechanism	Empirical Outcome
Shared/Private Decomposition	Dual subspace network heads	Tighter inter-modal alignment
MK-MMD Loss	RKHS distribution matching; modality separation	Lower retrieval error
Label Alignment	Collapse classes to prototypes before quantization	Reduced intra-class variance
Supervised Additive Quantization	Joint codebook learning in shared subspace	Improved MAP, precision

SPDQ’s explicit RKHS-based subspace modeling, strong label alignment, and integrated quantizer learning collectively underpin both its theoretical robustness and leading empirical results (Yang et al., 2019).

7. Methodological Significance

Shared Predictive Deep Quantization exemplifies a cohesive integration of several advanced concepts in cross-modal retrieval:

Subspace decomposition enhances semantic fidelity across modalities by explicitly modeling shared and modality-specific factors.
RKHS-based MMD alignment is more flexible than Euclidean similarity constraints, as it matches entire distributions within each semantic class.
Label alignment ensures robust representation collapse, directly optimizing for retrieval by minimizing intra-class spread before quantization.
End-to-end learning ensures that feature extraction and quantizer adaptation are synergistically optimized, contrasting favorably with two-stage or decoupled frameworks.

A plausible implication is that deep quantization approaches employing multi-granular alignment (distributional, label-aligned, and quantization error-aware) are likely to exhibit superior performance in large-scale, heterogeneous cross-modal retrieval settings.

Markdown Report Issue Upgrade to Chat

References (1)

Shared Predictive Cross-Modal Deep Quantization (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shared Predictive Deep Quantization (SPDQ).

SPDQ: Shared Predictive Deep Quantization

1. Dual-Stream Network Architecture: Shared and Private Subspaces

3. Supervised Additive Quantization with Label Alignment

4. End-to-End Objective and Optimization

Training Pipeline Pseudocode

5. Retrieval Process

6. Empirical Performance and Comparative Analysis

7. Methodological Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SPDQ: Shared Predictive Deep Quantization

1. Dual-Stream Network Architecture: Shared and Private Subspaces

2. RKHS Embedding and Cross-Modal Alignment with Maximum Mean Discrepancy

3. Supervised Additive Quantization with Label Alignment

4. End-to-End Objective and Optimization

Training Pipeline Pseudocode

5. Retrieval Process

6. Empirical Performance and Comparative Analysis

7. Methodological Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research