Probabilistic Deep Learning using Random Sum-Product Networks (1806.01910v2)

Published 5 Jun 2018 in cs.LG, cs.AI, and stat.ML

Abstract: The need for consistent treatment of uncertainty has recently triggered increased interest in probabilistic deep learning methods. However, most current approaches have severe limitations when it comes to inference, since many of these models do not even permit to evaluate exact data likelihoods. Sum-product networks (SPNs), on the other hand, are an excellent architecture in that regard, as they allow to efficiently evaluate likelihoods, as well as arbitrary marginalization and conditioning tasks. Nevertheless, SPNs have not been fully explored as serious deep learning models, likely due to their special structural requirements, which complicate learning. In this paper, we make a drastic simplification and use random SPN structures which are trained in a "classical deep learning manner", i.e. employing automatic differentiation, SGD, and GPU support. The resulting models, called RAT-SPNs, yield prediction results comparable to deep neural networks, while still being interpretable as generative model and maintaining well-calibrated uncertainties. This property makes them highly robust under missing input features and enables them to naturally detect outliers and peculiar samples.

Citations (28)

View on Semantic Scholar

Summary

The paper introduces a novel deep architecture using fixed random region graphs to construct Sum-Product Networks, enabling end-to-end training with SGD.
It demonstrates that RAT-SPNs deliver tractable likelihood computations and effective probabilistic dropout for handling missing features.
Experimental results show that RAT-SPNs achieve competitive classification accuracy while enhancing anomaly and out-of-domain detection.

This paper introduces Random Tensorized Sum-Product Networks (RAT-SPNs), a method designed to bridge the gap between deep learning and probabilistic modeling by leveraging the inference capabilities of Sum-Product Networks (SPNs) within a standard deep learning framework (1806.01910). While traditional deep learning models like VAEs or GANs often struggle with exact inference (e.g., likelihood evaluation), SPNs provide tractable computation of likelihoods and arbitrary marginals. However, SPN adoption has been limited due to complex structure learning requirements and non-standard parameter learning techniques. RAT-SPNs address this by using fixed, randomly generated structures that can be trained end-to-end using automatic differentiation and SGD, similar to conventional neural networks.

Constructing RAT-SPNs

The construction involves two main steps:

Random Region Graph Generation: A region graph defines the hierarchical partitioning of the input variables (scope). Algorithm 2 outlines a procedure to create this graph randomly. Starting with the full set of variables $%%%%0%%%%|_1| \approx |_2|%%%%1%%%%D%%%%2%%%%R$ times from the root, creating multiple pathways for partitioning the variables.

Algorithm 2: Random Region Graph
Input: Variables V, Depth D, Repetitions R
1. Initialize empty region graph G
2. Add V to G as the root region
3. For r = 1 to R:
4.   Split(G, V, D)
5. Return G

Procedure Split(G, Region R_current, depth d):
1. Draw balanced partition P = {R_1, R_2} of R_current
2. Add R_1, R_2 to G
3. Add P to G as child of R_current, parent of R_1, R_2
4. If d > 1:
5.   If |R_1| > 1: Split(G, R_1, d-1)
6.   If |R_2| > 1: Split(G, R_2, d-1)

SPN Generation from Region Graph: Algorithm 1 populates the region graph structure with SPN nodes (sums, products, input distributions) organized into tensors.
- Leaf Regions: Equipped with $I$ input distribution nodes (e.g., Gaussians).
- Root Region: Equipped with $C$ sum nodes (one for each class in classification). For density estimation, $C=1$ .
- Inner Regions: Equipped with $S$ sum nodes.
- Partitions $\{_1, _2\}$ : For every pair of nodes $(N_1, N_2)$ where $N_1$ belongs to region $_1$ and $N_2$ to region $_2$ , a product node $P = N_1 \times N_2$ is created. These product nodes become children of all sum nodes in the parent region $_1 \cup _2$ .

Algorithm 1: Construct SPN from Region Graph
Input: Region Graph G, Classes C, Sums per region S, Inputs per leaf I
1. Initialize empty SPN
2. For each region R in G:
3.   If R is a leaf region: Add I distribution nodes for R
4.   Else if R is the root region: Add C sum nodes for R
5.   Else: Add S sum nodes for R
6. For each partition P = {R_1, R_2} in G:
7.   Let Nodes_R be the nodes associated with region R
8.   For N1 in Nodes_R1, N2 in Nodes_R2:
9.     Create product node Prod = N1 * N2
10.    Make Prod a child of each node N in Nodes_(R1 U R2)
11. Return SPN

This process yields a deep architecture where nodes within a region can be processed as tensors, making it suitable for GPU acceleration using frameworks like TensorFlow. The structure guarantees the completeness and decomposability properties required for efficient SPN inference.

Implementation and Training

Framework: Implemented in TensorFlow, representing nodes in each region as matrices (batch size $\times$ number of nodes).
Log-Domain: Computations are performed in the log domain for numerical stability (using log-sum-exp for sums, simple summation for products).
Parameters: Sum weights are parameterized using log-softmax to ensure non-negativity and normalization. Input distributions used were isotropic Gaussians; means were learned, but fixing variances to 1 yielded better results than learning them.
Optimization: Trained using Adam optimizer with default settings and mini-batches.
Objective Function: A hybrid objective is used, combining cross-entropy ( $\mathsf{CE}$ ) for discriminative performance and normalized negative log-likelihood ( $\mathsf{nLL}$ ) for generative modeling:

$\mathsf{O}(\bm{w}) = \lambda \, \mathsf{CE}(\bm{w}) + (1 - \lambda) \, \mathsf{nLL}(\bm{w})$

The parameter $\lambda \in [0, 1]$ controls the trade-off. $\lambda=1$ is purely discriminative, $\lambda=0$ is purely generative (maximum likelihood).
Computational Cost: RAT-SPNs are noted to be about an order of magnitude slower than comparable ReLU MLPs due to the complexity of tensor operations and log-domain computations in the current implementation, though optimization is possible.

Probabilistic Dropout

To prevent overfitting in large RAT-SPNs, a probabilistic version of dropout is introduced:

Input Dropout: Interpreted as marginalizing out missing features. Instead of setting inputs to zero and rescaling, the corresponding input log-distribution nodes in the SPN are set to 0 (effectively making their contribution 1, representing marginalization). This drops entire blocks of units corresponding to a feature.
Sum Node Dropout: Leveraging the latent variable interpretation of SPNs, dropout at sum nodes is implemented by injecting discrete noise. For a region with $K$ children feeding into its sum nodes, dropout randomly selects a subset of the incoming product nodes (children of the sums) and sets their log-values to $-\infty$ , effectively pruning parts of the SPN structure during training. This is akin to sampling sub-structures, similar in spirit to standard dropout.

Applications and Experimental Results

Capacity: Experiments on MNIST show RAT-SPNs can scale to millions of parameters and achieve training accuracies comparable to MLPs, indicating similar representational capacity.
Generalization: On classification tasks (MNIST, Fashion-MNIST, 20 Newsgroups), RAT-SPNs with probabilistic dropout achieve performance comparable to MLPs (both standard and vanilla versions without batch norm/Xavier init).
Hybrid Modeling: By adjusting $\lambda$ post-training, a trade-off between discriminative accuracy and generative log-likelihood can be achieved. Small reductions in accuracy can lead to significant improvements in generative performance (input likelihood).
Robustness to Missing Features: RAT-SPNs, especially those trained with $\lambda < 1$ , demonstrate significantly higher robustness to missing input features compared to MLPs. Missing features are handled by exact marginalization in the SPN. For instance, a model with $\lambda=0.2$ maintained high accuracy even when >60% of features were missing, drastically outperforming MLPs.
Outlier and Out-of-Domain Detection: The generative nature ( $\lambda < 1$ $λ < 1$ ) allows RAT-SPNs to assign likelihoods to inputs.
- Qualitative: Low-likelihood samples on MNIST and Fashion-MNIST correspond to visually peculiar, ambiguous, or low-quality images.
- Quantitative (Transfer Testing): When testing an MNIST-trained RAT-SPN ( $\lambda=0.2$ ) on SVHN and SEMEION datasets, the input log-likelihoods clearly distinguish between in-domain (MNIST) and out-of-domain samples. This provides a reliable signal that the model "knows what it doesn't know." In contrast, a similar calculation using MLP outputs ("mock-likelihood") does not provide clear separation.

Practical Implications

RAT-SPNs offer a way to build deep probabilistic models that retain tractable inference (likelihoods, marginals, conditionals).
The random structure simplifies implementation significantly compared to traditional SPN structure learning.
They can be implemented and trained using standard deep learning tools (TensorFlow, PyTorch, SGD, GPUs).
The hybrid objective allows tuning the model for specific needs, balancing classification accuracy with generative capabilities.
Probabilistic dropout provides effective regularization.
Key applications include scenarios requiring uncertainty quantification, robustness to missing data, and anomaly/out-of-domain detection. The ability to evaluate input likelihood is a major advantage over purely discriminative models or implicit generative models.

This approach makes SPNs more accessible for deep learning practitioners seeking models with well-calibrated uncertainty and robust inference capabilities.

PDF Markdown