Contractive Denoising Autoencoders
- Contractive Denoising Autoencoders (CDAE) are neural models that integrate a denoising criterion with a contractive penalty to achieve robust and invariant feature representations.
- The model employs a symmetric encoder–decoder architecture optimized via reconstruction loss and the Frobenius norm of the encoder’s Jacobian to mitigate noise and local perturbations.
- Empirical results on MNIST demonstrate that CDAEs modestly improve classification accuracy compared to traditional autoencoders by effectively combining denoising and contraction.
Contractive Denoising Autoencoders (CDAEs) are a variant of autoencoder neural networks that combine the denoising criterion of Denoising Autoencoders (DAEs) and the local invariance of Contractive Autoencoders (CAEs). A CDAE is designed to be robust to both large input corruptions and infinitesimal perturbations, yielding feature representations that are simultaneously insensitive to structured noise and small input variations. In its canonical form, the model is realized as a symmetric encoder–decoder architecture and trained by optimizing a sum of denoising reconstruction loss and the Frobenius norm of the encoder’s Jacobian, subject to a stochastic corruption process on the input (Chen et al., 2013).
1. Architecture and Model Specification
A single-layer CDAE consists of an encoder–decoder pair operating on an input . The process begins by sampling a corrupted version . The encoder function maps into a hidden representation , and the decoder reconstructs the input as .
- Encoder: , with weight matrix , bias , and nonlinearity .
- Decoder: , with bias and tied weights for architectural symmetry.
The typical training loop per minibatch proceeds as:
- Corrupt:
- Encode:
- Decode:
- Compute loss and gradients
- Update parameters via stochastic gradient descent (SGD) or variants
2. Objective Function and Regularization
The CDAE objective function is the sum of a denoising reconstruction term and a contractive penalty, imposed for each data point :
- Denoising Loss: Enforces robustness to large corruptions by reconstructing the original input from its noisy version:
- Contractive Penalty: The Frobenius norm of the encoder’s Jacobian penalizes local sensitivity. For elementwise activations, this is: where and depends on the activation.
For (used in experiments):
3. Stochastic Corruption and Noise Models
CDAEs employ a stochastic corruption process to push the learned mapping toward the data manifold's structure.
- Masking noise: A fraction of input units is set to zero. In experiments, approximately every 80th pixel in the 784-dimensional MNIST inputs is masked (effectively ).
- Gaussian noise: Additive perturbation , .
The principal mechanism is to drive the model to reconstruct uncorrupted inputs from their noisy versions, fostering robustness to input-level noise.
4. Stacking, Training Methodology, and Optimization
CDAEs admit stacking to form deep architectures. The standard pretraining sequence is as follows:
- Train the first CDAE layer on to obtain .
- Use as input, apply corruption, and train a second CDAE to yield .
- Repeat stacking as desired.
Layer-wise unsupervised pretraining is performed independently for each module to minimize for its layer. After this stage, the network can act as a fixed feature extractor. In the primary experimental protocol, the learned codes from the middle layer are input to an SVM with RBF kernel; no supervised backpropagation-based fine-tuning is reported.
Parameter initialization follows the “Xavier” scheme: Biases are initialized to zero. Optimization is conducted using SGD or variants such as momentum and Adam.
5. Hyperparameters and Implementation Specifics
Empirical studies were conducted on MNIST using two symmetric autoencoder architectures:
- 784–200–100–200–784
- 784–200–50–200–784
Key hyperparameter settings:
- Hidden units : 100 or 50 (bottleneck layer).
- Contractive penalty weight : 0.1.
- Corruption level: every 80th pixel masked ().
- Activation: for both encoder and decoder layers.
- Weight initialization: Xavier uniform.
- Bias initialization: zeros.
- Batch size, learning rate, and number of epochs are not specified; typical values in related literature are batch size 100, learning rate 0.01–0.1, epochs 50–200.
6. Experimental Evaluation and Comparative Results
The CDAE was evaluated on a subset of 18,000 MNIST digit images (1,800 per class), split equally for training and test. After pretraining two stacked CDAE layers, the middle code (size 100 or 50) was used as input to a radial basis function SVM for classification.
Observed test accuracy:
| Architecture | AE | DAE | CAE | CDAE |
|---|---|---|---|---|
| 784–200–100–200–784 | 92.42% | 92.51% | 93.11% | 93.31% |
| 784–200–50–200–784 | 93.12% | 93.28% | 93.31% | 93.77% |
CDAE outperforms AE, DAE, and CAE on these MNIST subsets. No ablation studies varying or noise level are provided.
7. Theoretical Properties, Empirical Insights, and Limitations
CDAE’s dual regularization yields features that are robust both to large stochastic corruptions (denoising) and to infinitesimal input changes (contraction). The contractive penalty specifically encourages encoder invariance to local perturbations, producing smoother low-dimensional embeddings. Empirically, combining denoising and contractive penalties delivers a consistent, though modest, improvement in classification accuracy compared to either regularizer alone.
Identified limitations:
- All reported results are restricted to MNIST; extension to other domains remains unverified.
- The contractive penalty introduces additional computation per sample, with increasing cost for large hidden or input layers.
- No experiments addressing supervised end-to-end fine-tuning are reported; integration of CDAE pretraining into a full supervised pipeline is untested.
- Systematic study of hyperparameter impact (penalty weight , fraction of corrupted inputs, number of layers) is not conducted.
A plausible implication is that the approach is straightforward to implement and scale, but broader empirical validation and efficiency improvements remain open research directions (Chen et al., 2013).