Deep Matching Autoencoders (DMAE)
- The paper introduces DMAE, a method that jointly optimizes modality-specific autoencoders and an explicit matching mechanism to align data without complete pairing.
- It integrates reconstruction loss with a dependence-driven matching term (using uKTA or SMI) to learn a common latent space for cross-modal correspondence.
- DMAE employs alternating optimization and a relaxed permutation matrix approach, yielding competitive performance in image-text retrieval and unsupervised classifier learning.
Deep Matching Autoencoders (DMAE) provide a principled framework for learning a common latent representation and inferring the correspondence between data in multiple modalities (or “views”) when paired annotations are limited or absent. By combining modality-specific autoencoders with an explicit pairing inference mechanism, DMAE addresses the challenge of cross-modal representation learning in the fully supervised, semi-supervised, and fully unsupervised regimes. The method’s objective integrates autoencoding reconstruction with a dependence-driven matching term, enabling multi-modal embedding and object matching even when no initial correspondences are known (Mukherjee et al., 2017).
1. Problem Formulation and Motivation
DMAE is designed for scenarios involving two data views, denoted and , where and . Traditional cross-modal embedding approaches, such as CCA, DeepCCA, and two-branch ranking nets, require a set of explicitly paired data . However, many practical applications feature only partial or entirely unpaired datasets, rendering standard supervised or correlation-based losses inapplicable. DMAE simultaneously optimizes (i) representation learning via deep autoencoders for each modality and (ii) inference of the unknown one-to-one matching between and by maximizing a statistical dependence criterion in a shared latent space.
2. Network Architectures and Latent Spaces
For each modality, DMAE employs separate deep autoencoders:
- For view , the encoder maps to a latent code 0 via a composition of 1 layers (typically 2), followed by a decoder 3 expanding 4 back to the original space.
- For view 5, 6 and 7 are defined analogously, with an identical latent space dimensionality 8 to facilitate cross-view dependence estimation.
Encoders and decoders are modality-specific and do not share weights, but their aligned latent representations enable statistical dependence to be quantified across modalities.
3. Joint Reconstruction and Matching Objective
DMAE fundamentally seeks a joint solution to both representation learning and pairing inference. Let 9 be a permutation matrix representing the unknown correspondence, so that 0 if 1 is paired with 2. The full optimization objective is
3
with components:
- Reconstruction Loss:
4
- Matching/Dependence Term:
5
with 6 defined by 7, 8 iff 9.
DMAE explores two forms for 0:
- Unnormalized Kernel Target Alignment (uKTA): computes dependence as 1, with 2 and 3 Gram matrices from kernel functions on the latent codes.
- Squared-loss Mutual Information (SMI): employs kernel density-ratio estimation, providing a flexible weighting via learned 4. SMI reduces to uKTA if all 5 are uniform.
The hyperparameter 6 modulates the trade-off between within-modality autoencoding and cross-modal matching.
4. Permutation Matrix Estimation and Relaxation
The inference of 7 is a quadratic assignment problem, known to be NP-hard in the discrete case. DMAE relaxes 8 to allow continuous values in 9, enforcing (soft) row and column sum constraints to ensure near-permutation behavior:
0
This relaxation enables end-to-end differentiable optimization by gradient descent, with the final solution optionally projected back to a valid permutation. In semi-supervised and supervised regimes, entries of 1 for known pairs are fixed accordingly.
5. Optimization Strategy
DMAE employs an alternating optimization procedure:
A. Network Update: With 2 fixed, optimize 3, 4, 5, 6 by backpropagation on 7.
B. Matching Update: With encoder/decoder parameters fixed, update 8 by gradient-based minimization of 9 plus permutation regularization, under box constraints 0.
This alternation proceeds until convergence, reliably bootstrapping from random initial 1 to high-quality matchings, as evidenced by monotonically increasing mean precision/recall of 2 estimators over iterations.
6. Unification of Supervision Regimes
DMAE’s formulation is agnostic to the amount of cross-modal pairing available:
- Supervised: All 3 pairs known; 4 is fixed.
- Semi-supervised: Only 5 pairs known; loss terms for those indices are imposed directly, while unknown segments of 6 are optimized.
- Unsupervised: No pairs known (7). The entire matching structure is inferred.
This enables DMAE to gracefully interpolate between unsupervised object matching (e.g., Unsupervised Classifier Learning/UCL) and conventional cross-modal embedding.
7. Empirical Results and Ablative Analyses
Experimental evaluation demonstrates the effectiveness of DMAE across retrieval and unsupervised classifier learning tasks:
- Image-Sentence Retrieval (Flickr30K, MS-COCO):
- Fully supervised: DMAE-SMI matches or slightly exceeds DeepCCA and two-branch net baselines, e.g., on MS-COCO 8 (image9text).
- Semi-supervised (e.g., 40% paired + 60% unpaired): Outperforms Matching CCA and deep-MCCA by 5–10 points in 0.
- Fully unsupervised: Achieves nontrivial retrieval (e.g., Flickr30K 1, chance 2).
- Unsupervised Classifier Learning (AwA, CIFAR-10):
- No image-label pairs; a bag of class word-vectors available. DMAE-SMI yields average per-class precision 3 and recall 4 on AwA.
- SVMs trained on inferred label mappings yield substantial accuracy: AwA 5 (chance 6), CIFAR-10 7 (chance 8).
- Semi-supervised transductive variants further boost accuracy (CIFAR-10 9 label + 0 unlabeled + test seen: 1 accuracy).
Ablations establish the necessity of the reconstruction term (“Deep-SMI” without reconstruction is markedly less effective) and the consistent superiority of SMI over uKTA, attributed to SMI’s adaptive density-ratio weighting.
8. Context and Significance
DMAE introduces a mechanism for aligning multi-modal datasets when explicit pairings are missing, an area where classical deep metric learning and traditional CCA-based models are inapplicable. The method’s applicability extends to multi-view retrieval, unsupervised classifier induction, and generalizable cross-modal matching tasks, providing tractable optimization through relaxation and alternation strategies. The empirical evidence supports the framework’s capacity to discover meaningful correspondences and achieve competitive performance across varying regimes of supervision (Mukherjee et al., 2017).