Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Matching Autoencoders (DMAE)

Updated 1 April 2026
  • The paper introduces DMAE, a method that jointly optimizes modality-specific autoencoders and an explicit matching mechanism to align data without complete pairing.
  • It integrates reconstruction loss with a dependence-driven matching term (using uKTA or SMI) to learn a common latent space for cross-modal correspondence.
  • DMAE employs alternating optimization and a relaxed permutation matrix approach, yielding competitive performance in image-text retrieval and unsupervised classifier learning.

Deep Matching Autoencoders (DMAE) provide a principled framework for learning a common latent representation and inferring the correspondence between data in multiple modalities (or “views”) when paired annotations are limited or absent. By combining modality-specific autoencoders with an explicit pairing inference mechanism, DMAE addresses the challenge of cross-modal representation learning in the fully supervised, semi-supervised, and fully unsupervised regimes. The method’s objective integrates autoencoding reconstruction with a dependence-driven matching term, enabling multi-modal embedding and object matching even when no initial correspondences are known (Mukherjee et al., 2017).

1. Problem Formulation and Motivation

DMAE is designed for scenarios involving two data views, denoted X={xi}i=1nX = \{x_i\}_{i=1}^n and Y={yj}j=1nY = \{y_j\}_{j=1}^n, where xiRdxx_i \in \mathbb{R}^{d_x} and yjRdyy_j \in \mathbb{R}^{d_y}. Traditional cross-modal embedding approaches, such as CCA, DeepCCA, and two-branch ranking nets, require a set of explicitly paired data (xi,yi)(x_i, y_i). However, many practical applications feature only partial or entirely unpaired datasets, rendering standard supervised or correlation-based losses inapplicable. DMAE simultaneously optimizes (i) representation learning via deep autoencoders for each modality and (ii) inference of the unknown one-to-one matching between XX and YY by maximizing a statistical dependence criterion in a shared latent space.

2. Network Architectures and Latent Spaces

For each modality, DMAE employs separate deep autoencoders:

  • For view XX, the encoder Ex(x;Θx)E_x(x; \Theta_x) maps xRdxx \in \mathbb{R}^{d_x} to a latent code Y={yj}j=1nY = \{y_j\}_{j=1}^n0 via a composition of Y={yj}j=1nY = \{y_j\}_{j=1}^n1 layers (typically Y={yj}j=1nY = \{y_j\}_{j=1}^n2), followed by a decoder Y={yj}j=1nY = \{y_j\}_{j=1}^n3 expanding Y={yj}j=1nY = \{y_j\}_{j=1}^n4 back to the original space.
  • For view Y={yj}j=1nY = \{y_j\}_{j=1}^n5, Y={yj}j=1nY = \{y_j\}_{j=1}^n6 and Y={yj}j=1nY = \{y_j\}_{j=1}^n7 are defined analogously, with an identical latent space dimensionality Y={yj}j=1nY = \{y_j\}_{j=1}^n8 to facilitate cross-view dependence estimation.

Encoders and decoders are modality-specific and do not share weights, but their aligned latent representations enable statistical dependence to be quantified across modalities.

3. Joint Reconstruction and Matching Objective

DMAE fundamentally seeks a joint solution to both representation learning and pairing inference. Let Y={yj}j=1nY = \{y_j\}_{j=1}^n9 be a permutation matrix representing the unknown correspondence, so that xiRdxx_i \in \mathbb{R}^{d_x}0 if xiRdxx_i \in \mathbb{R}^{d_x}1 is paired with xiRdxx_i \in \mathbb{R}^{d_x}2. The full optimization objective is

xiRdxx_i \in \mathbb{R}^{d_x}3

with components:

  • Reconstruction Loss:

xiRdxx_i \in \mathbb{R}^{d_x}4

  • Matching/Dependence Term:

xiRdxx_i \in \mathbb{R}^{d_x}5

with xiRdxx_i \in \mathbb{R}^{d_x}6 defined by xiRdxx_i \in \mathbb{R}^{d_x}7, xiRdxx_i \in \mathbb{R}^{d_x}8 iff xiRdxx_i \in \mathbb{R}^{d_x}9.

DMAE explores two forms for yjRdyy_j \in \mathbb{R}^{d_y}0:

  • Unnormalized Kernel Target Alignment (uKTA): computes dependence as yjRdyy_j \in \mathbb{R}^{d_y}1, with yjRdyy_j \in \mathbb{R}^{d_y}2 and yjRdyy_j \in \mathbb{R}^{d_y}3 Gram matrices from kernel functions on the latent codes.
  • Squared-loss Mutual Information (SMI): employs kernel density-ratio estimation, providing a flexible weighting via learned yjRdyy_j \in \mathbb{R}^{d_y}4. SMI reduces to uKTA if all yjRdyy_j \in \mathbb{R}^{d_y}5 are uniform.

The hyperparameter yjRdyy_j \in \mathbb{R}^{d_y}6 modulates the trade-off between within-modality autoencoding and cross-modal matching.

4. Permutation Matrix Estimation and Relaxation

The inference of yjRdyy_j \in \mathbb{R}^{d_y}7 is a quadratic assignment problem, known to be NP-hard in the discrete case. DMAE relaxes yjRdyy_j \in \mathbb{R}^{d_y}8 to allow continuous values in yjRdyy_j \in \mathbb{R}^{d_y}9, enforcing (soft) row and column sum constraints to ensure near-permutation behavior:

(xi,yi)(x_i, y_i)0

This relaxation enables end-to-end differentiable optimization by gradient descent, with the final solution optionally projected back to a valid permutation. In semi-supervised and supervised regimes, entries of (xi,yi)(x_i, y_i)1 for known pairs are fixed accordingly.

5. Optimization Strategy

DMAE employs an alternating optimization procedure:

A. Network Update: With (xi,yi)(x_i, y_i)2 fixed, optimize (xi,yi)(x_i, y_i)3, (xi,yi)(x_i, y_i)4, (xi,yi)(x_i, y_i)5, (xi,yi)(x_i, y_i)6 by backpropagation on (xi,yi)(x_i, y_i)7.

B. Matching Update: With encoder/decoder parameters fixed, update (xi,yi)(x_i, y_i)8 by gradient-based minimization of (xi,yi)(x_i, y_i)9 plus permutation regularization, under box constraints XX0.

This alternation proceeds until convergence, reliably bootstrapping from random initial XX1 to high-quality matchings, as evidenced by monotonically increasing mean precision/recall of XX2 estimators over iterations.

6. Unification of Supervision Regimes

DMAE’s formulation is agnostic to the amount of cross-modal pairing available:

  • Supervised: All XX3 pairs known; XX4 is fixed.
  • Semi-supervised: Only XX5 pairs known; loss terms for those indices are imposed directly, while unknown segments of XX6 are optimized.
  • Unsupervised: No pairs known (XX7). The entire matching structure is inferred.

This enables DMAE to gracefully interpolate between unsupervised object matching (e.g., Unsupervised Classifier Learning/UCL) and conventional cross-modal embedding.

7. Empirical Results and Ablative Analyses

Experimental evaluation demonstrates the effectiveness of DMAE across retrieval and unsupervised classifier learning tasks:

  • Image-Sentence Retrieval (Flickr30K, MS-COCO):
    • Fully supervised: DMAE-SMI matches or slightly exceeds DeepCCA and two-branch net baselines, e.g., on MS-COCO XX8 (imageXX9text).
    • Semi-supervised (e.g., 40% paired + 60% unpaired): Outperforms Matching CCA and deep-MCCA by 5–10 points in YY0.
    • Fully unsupervised: Achieves nontrivial retrieval (e.g., Flickr30K YY1, chance YY2).
  • Unsupervised Classifier Learning (AwA, CIFAR-10):
    • No image-label pairs; a bag of class word-vectors available. DMAE-SMI yields average per-class precision YY3 and recall YY4 on AwA.
    • SVMs trained on inferred label mappings yield substantial accuracy: AwA YY5 (chance YY6), CIFAR-10 YY7 (chance YY8).
    • Semi-supervised transductive variants further boost accuracy (CIFAR-10 YY9 label + XX0 unlabeled + test seen: XX1 accuracy).

Ablations establish the necessity of the reconstruction term (“Deep-SMI” without reconstruction is markedly less effective) and the consistent superiority of SMI over uKTA, attributed to SMI’s adaptive density-ratio weighting.

8. Context and Significance

DMAE introduces a mechanism for aligning multi-modal datasets when explicit pairings are missing, an area where classical deep metric learning and traditional CCA-based models are inapplicable. The method’s applicability extends to multi-view retrieval, unsupervised classifier induction, and generalizable cross-modal matching tasks, providing tractable optimization through relaxation and alternation strategies. The empirical evidence supports the framework’s capacity to discover meaningful correspondences and achieve competitive performance across varying regimes of supervision (Mukherjee et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Matching Autoencoders (DMAE).