Maxout Networks (1302.4389v4)

Published 18 Feb 2013 in stat.ML and cs.LG

Abstract: We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.

Citations (2,146)

View on Semantic Scholar

Summary

The paper introduces maxout networks, a novel activation function that synergizes with dropout to significantly improve classification performance on benchmark datasets.
The paper rigorously proves that maxout networks are universal approximators, capable of modeling complex convex functions with simple piecewise linear transformations.
Empirical evaluations demonstrate notable reductions in test errors, achieving 0.45% on MNIST, 9.38% on CIFAR-10, and record-setting results on CIFAR-100 and SVHN.

Maxout Networks

The paper "Maxout Networks" by Goodfellow et al. introduces a novel neural network model characterized by the maxout activation function. This model is shown to enhance the performance of dropout, an established regularization technique, and significantly advances the state-of-the-art in classification tasks on several benchmark datasets.

Introduction

Dropout is a stochastic regularization method that trains a large ensemble of networks by randomly dropping units during training. This process approximates model averaging and has shown efficacy in preventing overfitting. However, dropout as conventionally applied is not explicitly designed to perform model averaging in deep architectures. The maxout network proposed in this paper is a direct effort to synergize with dropout, facilitating both optimization and approximating model averaging more accurately.

Maxout Activation Function

The maxout unit is a novel activation function which computes the maximum of a set of affine transformations of the input. For a given input $x \in \mathbb{R}^d$ , a maxout hidden layer realizes: $h_i(x) = \max_{j \in [1,k]} z_{ij}$ where $z_{ij} = x^T W_{\cdots ij} + b_{ij}$ with learned parameters $W \in \mathbb{R}^{d \times m \times k}$ and $b \in \mathbb{R}^{m \times k}$ . This form serves as a piecewise linear approximation to arbitrary convex functions, allowing the network to adaptively learn activation functions.

Theoretical Foundations

The authors prove that maxout networks are universal approximators: a maxout network with two hidden units can approximate any continuous function on a compact domain arbitrarily well. This significant theoretical result underpins the versatility and robustness of maxout units in representing complex functions.

Empirical Performance

Maxout networks, combined with dropout, achieve state-of-the-art performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN. The primary numerical results of interest are:

MNIST: Achieved a test error rate of 0.45%, surpassing previous results without unsupervised pretraining.
CIFAR-10: With data augmentation, achieved an error rate of 9.38%, an improvement of over two percentage points from the prior state-of-the-art.
CIFAR-100: Test error of 38.57%, setting a new record.
SVHN: Achieved a test error rate of 2.47%, again establishing a new benchmark.

Optimization Benefits

Maxout networks exhibit superior optimization properties, particularly when trained with dropout, due to their inherent piecewise linearity:

Gradient Flow: Maxout units ensure gradient flow even in deeper architectures, addressing saturation issues present in other activation functions.
Training Dynamics: Dropout training with maxout units enables larger updates due to the lack of bounded activations, unlike rectified linear units which can suffer from dead units (those stuck at zero activation).

Model Averaging with Dropout

The paper analyzes the effectiveness of dropout as an approximation to model averaging. Empirical results suggest that the predictive distributions achieved by dropout in maxout networks are a close approximation to the geometric mean of exponentially many sub-models, validating the theoretical appeal of maxout in conjunction with dropout.

Future Directions

The paper opens avenues for designing models tailored to enhance dropout's approximation capability further. Future research could explore different architectures incorporating maxout units or other innovative activation functions that exploit dropout more effectively.

Conclusion

The maxout network demonstrates how tailored activation functions can significantly enhance training methodologies like dropout. By proving theoretical robustness and showing empirical superiority across multiple benchmarks, this work substantially advances the field's understanding of model design and regularization in deep learning networks. The improvements achieved by maxout networks motivate further exploration into synergistic model-regularization pairings in artificial intelligence research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bozavlado/status/1785700459849650247

YouTube

Show All Videos