Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
(1506.02142v6)
Published 6 Jun 2015 in stat.ML and cs.LG
Abstract: Deep learning tools have gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty. In comparison, Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with a prohibitive computational cost. In this paper we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. A direct result of this theory gives us tools to model uncertainty with dropout NNs -- extracting information from existing models that has been thrown away so far. This mitigates the problem of representing uncertainty in deep learning without sacrificing either computational complexity or test accuracy. We perform an extensive study of the properties of dropout's uncertainty. Various network architectures and non-linearities are assessed on tasks of regression and classification, using MNIST as an example. We show a considerable improvement in predictive log-likelihood and RMSE compared to existing state-of-the-art methods, and finish by using dropout's uncertainty in deep reinforcement learning.
The paper demonstrates that dropout training approximates Bayesian inference by linking dropout with deep Gaussian processes through KL divergence minimization.
It details how Monte Carlo dropout sampling yields accurate estimates of predictive means and variances, improving RMSE and log-likelihood metrics over traditional methods.
The study highlights the practical benefits of modeled uncertainty in regression, classification (e.g., MNIST), and deep reinforcement learning via techniques like Thompson sampling.
The paper introduces a theoretical framework that interprets dropout training in deep neural networks (NNs) as an approximate Bayesian inference in deep Gaussian processes (GPs). The authors posit that this perspective enables the modeling of uncertainty using dropout NNs without compromising computational complexity or test accuracy. They conduct an extensive paper of the properties of dropout's uncertainty, assessing various network architectures and non-linearities on regression and classification tasks, exemplified by MNIST. The results demonstrate a considerable improvement in predictive log-likelihood and Root Mean Squared Error (RMSE) compared to existing state-of-the-art methods. Furthermore, the paper showcases the utility of dropout's uncertainty in deep reinforcement learning (RL).
The paper addresses the limitation of standard deep learning tools in capturing model uncertainty, contrasting them with Bayesian models that offer a mathematically grounded framework for reasoning about uncertainty but often at a prohibitive computational cost. The authors observe that predictive probabilities from softmax outputs in classification are often misinterpreted as model confidence, leading to unjustified high confidence in extrapolations far from the training data. They argue that model uncertainty is crucial for deep learning practitioners to handle uncertain inputs and special cases, such as deciding when to pass an input to a human for classification or enabling agents in RL to balance exploration and exploitation using uncertainty estimates over Q-value functions.
The authors show that dropout, a technique commonly used to avoid overfitting, can be interpreted as approximately integrating over the models' weights, thus providing a Bayesian approximation of a GP. This interpretation allows for the representation of model uncertainty in existing dropout NNs without altering the models or optimization process.
The paper provides a complete theoretical treatment of the link between GPs and dropout and develops the tools necessary to represent uncertainty in deep learning. The key contributions and findings can be summarized as follows:
The paper shows that a NN with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep GP, marginalized over its covariance function parameters. The dropout objective minimizes the Kullback-Leibler (KL) divergence between an approximate distribution and the posterior of a deep GP, marginalized over its finite rank covariance function parameters.
The authors derive results that model uncertainty can be obtained from dropout NN models. The approximate predictive distribution is given by q(y∗∣x∗)=∫p(y∗∣x∗,ω)q(ω)dω, where ω={Wi}i=1L is the set of random variables for a model with L layers.
The authors use moment-matching and estimate the first two moments of the predictive distribution empirically. They sample T sets of vectors of realisations from the Bernoulli distribution {z1t,...,zLt}t=1T with zit=[zi,jt]j=1Ki and estimate the predictive mean as Eq(y∗∣x∗)(y∗)≈T1t=1∑Ty∗(x∗,W1t,...,WLt).
The predictive variance is estimated as Varq(y∗∣x∗)(y∗)≈τ−1ID+T1t=1∑Ty∗(x∗,W1t,...,WLt)Ty∗(x∗,W1t,...,WLt)−Eq(y∗∣x∗)(y∗)TEq(y∗∣x∗)(y∗), where the model precision τ is found from the identity τ=2Nλpl2, with weight-decay λ and prior length-scale l.
The predictive log-likelihood is estimated by Monte Carlo integration. For regression, it is given by logp(y∗∣x∗,X,Y)≈logsumexp(−21τ∣∣y−yt∣∣2)−logT−21log2π−21logτ−1, with a log-sum-exp of T terms and yt stochastic forward passes through the network.
The authors conduct experiments to assess the properties of the uncertainty estimates obtained from dropout NNs and convnets on regression and classification tasks. They compare the uncertainty obtained from different model architectures and non-linearities on extrapolation tasks and demonstrate the importance of model uncertainty for classification tasks using MNIST.
On regression tasks, the authors train several models on the CO2 dataset, including NNs with 4 or 5 hidden layers and 1024 hidden units, using either ReLU or TanH non-linearities, and dropout probabilities of 0.1 or 0.2. They compare the results with those obtained from a GP with a squared exponential covariance function. The results show that standard dropout NNs predict values with high confidence, even when the predictions are not sensible, whereas MC dropout increases the predictive uncertainty in such cases, expressing the models' uncertainty about the point.
On classification tasks, the authors test a convolutional neural network trained on the full MNIST dataset. They train the LeNet convolutional neural network model with dropout applied before the last fully connected inner-product layer and evaluate the trained model on a continuously rotated image of the digit 1. The results show that the model predicts classes with high confidence when the uncertainty envelope of a class is far from that of other classes, whereas the softmax output uncertainty can be as large as the entire space when the uncertainty envelope intersects that of other classes.
The authors compare the RMSE and predictive log-likelihood of dropout to that of Probabilistic Back-propagation and to a variational inference technique in Bayesian NNs. The results show that dropout significantly outperforms all other models in terms of RMSE and test log-likelihood on all datasets apart from Yacht, for which PBP obtains better RMSE.
Finally, the authors demonstrate the use of model uncertainty in a Bayesian pipeline, giving a quantitative assessment of the model's performance in the setting of reinforcement learning on a task similar to that used in deep reinforcement learning. They train the original model and an additional model with dropout with probability 0.1 applied before the every weight layer. To make use of the dropout Q-network's uncertainty estimates, they use Thompson sampling instead of epsilon greedy. The results show that Thompson sampling converges faster than epsilon greedy while avoiding over-fitting. Specifically, Thompson sampling achieves a reward larger than 1 within 25 batches from burn-in, whereas epsilon greedy takes 175 batches to achieve the same performance.