Do Deep Generative Models Know What They Don't Know? (1810.09136v3)

Published 22 Oct 2018 in stat.ML and cs.LG

Abstract: A neural network deployed in the wild may be asked to make predictions for inputs that were drawn from a different distribution than that of the training data. A plethora of work has demonstrated that it is easy to find or synthesize inputs for which a neural network is highly confident yet wrong. Generative models are widely viewed to be robust to such mistaken confidence as modeling the density of the input features can be used to detect novel, out-of-distribution inputs. In this paper we challenge this assumption. We find that the density learned by flow-based models, VAEs, and PixelCNNs cannot distinguish images of common objects such as dogs, trucks, and horses (i.e. CIFAR-10) from those of house numbers (i.e. SVHN), assigning a higher likelihood to the latter when the model is trained on the former. Moreover, we find evidence of this phenomenon when pairing several popular image data sets: FashionMNIST vs MNIST, CelebA vs SVHN, ImageNet vs CIFAR-10 / CIFAR-100 / SVHN. To investigate this curious behavior, we focus analysis on flow-based generative models in particular since they are trained and evaluated via the exact marginal likelihood. We find such behavior persists even when we restrict the flows to constant-volume transformations. These transformations admit some theoretical analysis, and we show that the difference in likelihoods can be explained by the location and variances of the data and the model curvature. Our results caution against using the density estimates from deep generative models to identify inputs similar to the training distribution until their behavior for out-of-distribution inputs is better understood.

Citations (720)

View on Semantic Scholar

Summary

The paper demonstrates that deep generative models struggle with out-of-distribution detection by using a second-order log-likelihood expansion.
It reveals that covariance differences, particularly in models like CV-GLOW on CIFAR-SVHN, are key to understanding misleading high likelihoods.
The findings underline critical practical and theoretical limitations, motivating future research on hybrid models for improved OOD robustness.

Do Deep Generative Models Know What They Don't Know?

The paper "Do Deep Generative Models Know What They Don't Know?" by Nalisnick, Matsukawa, Teh, Gorur, and Lakshminarayanan investigates an important aspect of deep generative models: their ability to discern when they are failing, specifically in the context of out-of-distribution (OOD) detection.

Overview and Core Contribution

The core contribution of the paper centers around evaluating the failure modes of deep generative models, notably Variational Autoencoders (VAEs) and Flow-based models, in recognizing OOD data. The paper is motivated by the observation that generative models, despite their capacity to learn complex distributions, may still assign high likelihoods to data points from a completely different distribution than the one they were trained on.

Methodology

The authors employ a second-order expansion of the log-likelihood function around an interior point $x_0$ , providing insight into the behavior of the log-likelihood under perturbations. This expansion allows an approximation of the log-likelihood differences between in-distribution and OOD data. Formally, the expansion is given by:

$\log p(x) \approx \log p(x_0) + \nabla_{x_0} \log p(x_0)^T (x - x_0) + \frac{1}{2} \text{Tr}\{ \nabla_{x_0}^2 \log p(x_0) (x - x_0)(x - x_0)^T \}$

By taking expectations and denoting covariance as $\Sigma$ , the paper derives conditions under which the models fail to differentiate between in-distribution and OOD data. Specifically, it was found that the means of CIFAR images and SVHN images are roughly similar, complicating OOD detection.

Main Findings

Likelihood Gap Analysis: For the CIFAR-SVHN dataset pair, the paper shows that the covariance adjustment term $\Sigma_q - \Sigma_p$ dominantly influences the log-likelihood differences. For instance, for the CV-GLOW model, the trace term $\Tr \{ [\nabla^2_{x_0} \log p(x_0)] (\Sigma_q - \Sigma_p) \}$ is critical.
Failure in OOD Identification: The empirically derived likelihoods for OOD data (e.g., SVHN evaluated on a model trained on CIFAR-10) often do not significantly differ from in-distribution data, resulting in an erroneous high likelihood for OOD data.

Numerical Results

A striking numerical result is that, for the CV-GLOW model, the terms involving variance along color channels show significant values, indicating the difficulty in discerning OOD data. Explicitly, the result shows:

$\mathbb{E}_{\text{\tiny SVHN}}[\log p(x)] - \mathbb{E}_{\text{\tiny CIFAR10}}[\log p(x)] \approx \frac{-1}{2\sigma^{2}} \left[\alpha_1(49.6 - 61.9) + \alpha_2(52.7 - 59.2) + \alpha_3(53.6 - 68.1) \right] \ge 0$

This numerical finding reinforces the assertion about the limitations of generative models in OOD detection.

Implications and Future Directions

Practical Implications

The practical implication of this research is significant for deploying generative models in real-world applications where robustness to OOD inputs is crucial. Tasks such as anomaly detection, surveillance, and any deployment in open-world scenarios could be severely impacted if models fail to recognize foreign data effectively.

Theoretical Implications

From a theoretical perspective, this paper raises essential questions about the fundamental limitations of likelihood-based generative models and the necessity for more advanced, potentially hybrid approaches (combining generative models with discriminative ensemblers) for robust OOD detection.

Future Directions

Potential future research directions inspired by this paper include:

Development of Novel OOD Detection Algorithms: Improved methodologies leveraging both generative and discriminative properties to enhance OOD detection capabilities.
Hybrid Model Architectures: Exploration of hybrid architectures that could inherently account for OOD uncertainty.
Evaluation on Diverse Datasets: Extending the evaluation framework to a wider variety of datasets and model architectures to generalize findings and recommendations.

In summary, the paper "Do Deep Generative Models Know What They Don't Know?" provides critical insights into the limitations of current deep generative models concerning OOD detection and opens avenues for further research efforts aimed at enhancing the reliability and applicability of these models in diverse settings.

PDF Markdown

Related Papers

YouTube

Show All Videos