Doubly Stochastic Adversarial Autoencoder
- The paper introduces DS-AAE, a novel generative autoencoder that replaces deterministic discriminators with a stochastic function space to mitigate adversarial overfitting.
- It employs random feature mappings to smooth gradients and enforce a robust matching between the latent aggregated posterior and the prior distribution.
- Empirical results on binary MNIST indicate improved mode coverage and sample diversity compared to standard AAEs, despite modest Parzen-window likelihood scores.
The Doubly Stochastic Adversarial Autoencoder (DS-AAE) is a generative autoencoder architecture that replaces the deterministic adversary of Adversarial Autoencoders (AAEs) with a space of stochastic functions parameterized via random feature mappings. This innovation, leveraging kernel–random process duality, introduces a controlled source of auxiliary randomness to the adversarial regularization. DS-AAE targets key limitations in traditional adversarial autoencoders, particularly overfitting of the adversary and inadequate mode coverage, by promoting exploration and sample diversity while preserving computational efficiency (Azarafrooz, 2018).
1. Architecture and Key Components
DS-AAE consists of three principal modules:
- Encoder : A deterministic, feed-forward network mapping data samples to latent codes .
- Decoder (Generator) : A mirroring feed-forward network reconstructing the data from the latent code.
- Stochastic Adversary : In contrast to the standard parametrized discriminator of AAEs, the adversarial function is chosen from a continuum of stochastic functions, constructed through random feature maps parameterized by auxiliary randomness and linear weights .
Graphically, data, prior samples, and aggregated posterior codes flow through distinct branches. The adversary judges codes originating from the imposed prior and the aggregated posterior , maximizing their discrepancy, while encoder and decoder networks are updated to minimize it.
2. Mathematical Formulation
The DS-AAE objective combines a reconstruction criterion with a doubly stochastic adversarial penalty:
- Reconstruction Loss: For data and code , the loss is
for cross-entropy, or alternatively mean-squared error.
- Adversarial Regularizer: To impose prior on the latent space, a discrepancy minimization is formulated. Standard approaches use an explicit kernel in Maximum Mean Discrepancy (MMD):
DS-AAE improves upon this by defining a doubly stochastic gradient via random features:
Any admissible adversary is approximated by a linear form , and the regularizer becomes
The overall optimization is
The stochastic feature mapping smooths gradients and prevents the adversary from tightly overfitting to the generator.
3. Optimization and Training Procedure
The DS-AAE is trained by minibatch-based alternating updates:
- Sample Data and Prior: Draw a minibatch of data points and prior samples.
- Latent Encoding: Map data through the encoder to obtain latent codes.
- Random Feature Sampling: Draw random feature parameters and compute corresponding random features.
- Build Doubly Stochastic Features: For both data and prior, compute doubly stochastic features via aggregation of random feature contributions.
- Adversary Step: Update adversary parameters via gradient ascent to maximize discrepancy between prior and aggregated posterior features.
- Generator/Encoder Step: Jointly update to minimize the reconstruction loss and reduce adversarially measured discrepancy.
Learning rates for all modules are set via Adam optimization (), with specific dropout only on the encoder's input. Empirical findings underscore batch size sensitivity and the need for small learning rates when adversarial functions temporarily drift outside the RKHS, ensuring convergence.
4. Empirical Performance and Comparisons
Experiments were conducted on binary-thresholded MNIST using the following architecture (for DS-AAE): an encoder with three fully connected layers (1024→512→256→6), and a decoder symmetric in structure (256→512→1024→784), with ReLU activations except the final sigmoid layer.
The imposed prior is a six-dimensional isotropic Gaussian. Random features were instantiated using an RBF kernel (), drawing ~500 features per batch.
Performance was evaluated via Parzen-window test log-likelihood (on 10K samples), compared to GAN, GMMN+AE, AAE, and MMD-AE:
| Model | Parzen LL ( std) |
|---|---|
| GAN | |
| GMMN+AE | |
| AAE | |
| MMD-AE | |
| DS-AAE |
Qualitatively, DS-AAE samples demonstrated greater diversity in handwriting styles compared to AAE and MMD-AE and maintained sharp, hole-free latent traversals across classes, with increased multimodality.
5. Advantages, Limitations, and Distinctive Features
Advantages over AAE and VAE:
- Enhanced mode exploration due to auxiliary randomness , reducing adversarial overfitting and generator collapse.
- Smoother adversarial gradients contribute to improved training stability.
- Increased capacity to match multimodal priors owing to the continuum of admissible stochastic adversaries.
Limitations:
- Elevated sensitivity to batch size; insufficiently large batches degrade gradient approximation quality.
- Temporary excursions of adversarial functions outside the RKHS require conservative learning rates, slowing training.
- Parzen-window test likelihoods indicate DS-AAE underperforms AAE quantitatively, though diversity is improved.
6. Extensions and Applications
Potential future directions include:
- Applying alternate positive-definite kernels (polynomial, Laplacian) by adapting corresponding random-feature sketches.
- Incorporating convolutional architectures to scale to natural image datasets (e.g., CIFAR, CelebA).
- Developing semi-supervised or conditional variants by conditioning the prior on auxiliary information such as labels.
- Stacking DS-AAE modules hierarchically for enriched latent structure.
- Exploring applications in domain adaptation, anomaly detection, and across modalities (including text and time series).
DS-AAE represents a refinement of autoencoding adversarial frameworks, leveraging stochastic function spaces for the adversary to foster diversity and mitigate mode collapse, at the cost of increased sensitivity to batch size and potentially slower convergence dynamics (Azarafrooz, 2018).