Concrete Distribution for Differentiable Models
- Concrete distribution is a parameterized continuous distribution defined over the probability simplex, serving as a smooth relaxation of discrete one-hot representations.
- It employs the Gumbel-Softmax trick to generate differentiable samples, which facilitates low-variance, pathwise gradient estimation in models with discrete latent variables.
- Its application in variational autoencoders and feature selection showcases improved performance and efficient reparameterization compared to traditional discrete methods.
The Concrete distribution is a parameterized family of continuous distributions on the probability simplex that provides a differentiable relaxation of categorical (i.e., discrete one-hot) random variables. Its defining property is that it enables reparameterization-based gradient estimation for models with discrete latent structures, permitting low-variance, pathwise derivatives. The distribution is closely related to the Gumbel-Softmax, with both sharing the same sampling and relaxation mechanism. Applications include variational autoencoders with discrete latents, feature selection, and end-to-end differentiable architectures for tasks traditionally relying on non-differentiable choices (Maddison et al., 2016, Abid et al., 2019, Chow, 2022).
1. Definition and Fundamental Properties
Let be the number of categories, denote positive unnormalized logits, and the temperature. The -dimensional random vector (where is the probability simplex) is said to have the Concrete distribution with parameters , denoted (Maddison et al., 2016, Chow, 2022).
The closed-form density for is: or, equivalently,
0
where 1 (Chow, 2022). Each term in the density accounts for the necessary normalization, coordinate singularities, and coupling among coordinates.
2. Sampling and Reparameterization
Sampling from the Concrete distribution employs the Gumbel-Softmax trick:
- Draw 2 independently for each 3.
- Compute
4
Simple code uses either standard libraries or manual computation via 5, 6.
This sampling mechanism is central to the reparameterization trick, as the random variables can be produced as a differentiable function of parameters and noise. Consequently, gradients with respect to 7 or upstream parameters (e.g., neural network weights) can be computed via automatic differentiation, yielding low-variance estimators for model training objectives (Maddison et al., 2016, Abid et al., 2019).
3. Geometric and Information-Theoretic Structure
The Concrete distribution has deep geometric properties on the simplex. The density can be understood as a transformation of the uniform distribution via a Hadamard product and scaling in Aitchison geometry: 8 for 9 uniform on the simplex (Chow, 2022).
The Fisher information metric computed for the Concrete family induces a hyperbolic geometry isometric to the Poincaré half-space:
0
where new coordinates 1 for 2, 3, and 4 (Chow, 2022).
The corresponding Fisher-Rao geodesic distance between Concrete distributions (5, 6) is: 7 which has a closed-form when mapping back to the 8 parameterization.
4. Temperature Limit Behavior
The temperature parameter 9 fundamentally modulates the relaxation:
- As 0: The softmax function becomes an 1, and the Concrete distribution puts all mass on the simplex vertices, thus recovering the original categorical (discrete) distribution:
2
- As 3: The softmax flattens, and the distribution converges to the uniform distribution on the simplex.
Temperature annealing schedules are commonly used in applications to transition from smooth to nearly discrete behavior, e.g., exponential schedules 4 (Abid et al., 2019).
5. Application in Stochastic Computation Graphs and Autoencoders
Replacing discrete nodes with Concrete random variables allows end-to-end differentiable training of neural networks incorporating discrete latent choices. For a surrogate objective 5, the gradient estimator is: 6 which can be directly implemented by backpropagation through the softmax-based transformation (Maddison et al., 2016). This approach yields unbiased gradients for the relaxed problem and low-variance, albeit biased, gradients relative to the original discrete setup.
A prominent architecture leveraging the Concrete distribution is the concrete autoencoder (Abid et al., 2019). Here, a “concrete selector layer” parameterized via multiple Concrete random variables is used to select 7 features from an input vector in a fully differentiable manner. The temperature is annealed during training for progressive discretization, and at test time, the mode of each selector replaces the stochastic variable for hard selection.
6. Practical Implementation and Empirical Results
In variational autoencoder architectures for density estimation and structured prediction (e.g., on MNIST or Omniglot), replacing categorical latent variables with Concrete relaxations enables evaluating the KL term in closed form and permits efficient backpropagation. In practice, for deep, nonlinear architectures, the Concrete approach has demonstrated superior performance relative to score-function-based methods such as VIMCO, as evidenced by lower negative log-likelihoods (e.g., test NLL ≈ 89.5 for Concrete vs. ≈ 91.4 for VIMCO on two-layer nonlinear VAEs with importance weighting) (Maddison et al., 2016).
For differentiable feature selection, concrete autoencoders can efficiently select informative subsets of features and reconstruct the remainder, outperforming previous state-of-the-art selection methods on diverse datasets including large-scale gene expression matrices (Abid et al., 2019).
Key checklist items for practical deployment include Gumbel sampling, temperature annealing, and standard optimizer choices (Adam, proper initialization). The overall differentiable pipeline can be realized with a minimal amount of additional code atop standard deep learning frameworks.
7. Connections, Variants, and Theoretical Extensions
The Concrete distribution is equivalent to the Gumbel-Softmax; the key distinction in Maddison et al.’s treatment is the provision of a closed-form density and the prescription to use the Concrete density in place of discrete log-mass terms whenever objective terms involve the log-probability of latent assignments (Maddison et al., 2016). The Concrete family’s geometric and informational properties, as well as explicit closed-form expressions for probability densities and Fisher-Rao distances, make it a subject of ongoing theoretical investigation (Chow, 2022).
This distribution has catalyzed a broad array of research into continuous relaxations for discrete structures, stochastic computation graph optimization, and neural feature selection, and remains foundational in the design of modern machine learning methods for discrete latent variable modeling.