Discrete Latent Variable Reparameterization

Updated 27 September 2025

Discrete latent variable reparameterization comprises strategies that enable gradient-based optimization in models with discrete random components by addressing inherent non-differentiability.
Techniques such as quantile-based transforms, continuous relaxations (e.g., Gumbel-Softmax), and unbiased estimators effectively reduce variance in training deep generative and structured models.
These methods are critical for training systems like VAEs, Bayesian networks, and discrete neural architectures, offering theoretical guarantees and improved performance in probabilistic inference.

Discrete latent variable reparameterization denotes a suite of algorithmic strategies for enabling efficient and low-variance gradient-based optimization in probabilistic models where at least some latent variables are discrete. These strategies are crucial for training modern deep generative models, variational autoencoders (VAEs), Bayesian networks, probabilistic neural networks with discrete weights/activations, and structured prediction models. Discrete latent variables induce non-differentiability in the generative or inference process, impeding direct backpropagation. Diverse reparameterization schemes have been developed to address this, encompassing direct marginalization, continuous relaxations, quantile-based transforms, unbiased estimators, and advanced control variate augmentation. The following sections synthesize the technical developments, theoretical guarantees, and key application domains characterizing the state of discrete latent variable reparameterization.

1. Stochastic Gradient Estimation and the Reparameterization Challenge

Continuous latent variables allow for pathwise derivative estimators via the reparameterization trick, exemplified by $z = \mu + \sigma \epsilon$ for Gaussians, so $\nabla_\theta E_{q(z;\theta)}[f(z)] = E_{\epsilon}[\nabla_{z} f(z) \nabla_\theta z]$ . Discrete latent variables, in contrast, preclude such direct mappings due to the inherent discontinuity of their sampling operations (e.g. categorical selection, Bernoulli thresholding). Traditional score-function (REINFORCE) estimators are unbiased but suffer from prohibitively high variance, limiting their practicality for deep models with many discrete latents.

Key difficulties arise in two representative scenarios:

Mixture density models: The assignment of samples to components is parameterized by discrete mixture weights. The standard reparameterization is inapplicable to the categorical variable controlling the component selection (Graves, 2016).
Discrete neural weights/activations: Training with binary/ternary weights or sign activations cannot be performed using naive pathwise gradients, yet efficient inference critically depends on such representations (Shayer et al., 2017, Berger et al., 2023).

Overcoming these obstacles requires alternative treatments, each tailored to maintain unbiasedness, control estimator variance, and enable scalable learning.

2. Quantile-based and Marginalization-based Techniques

Mixture density models and models with factorized discrete latents benefit from strategies that leverage the structure of their joint densities. One class is the quantile transform-based approach, applicable to continuous random vectors $x$ with differentiable density $f(x)$ . The transformation constructs $x$ from independent uniform $u_d$ :

For $d>1$ , $x_d = F_d^{-1}(u_d|\mathbf{x}_{<d})$ , where $F_d$ is the conditional CDF.
Differentiating $F_d(x_d|\cdot) = u_d$ (using the Leibniz rule) yields terms relating $\partial F_d/\partial \theta$ to the density gradients.

While this enables backpropagation through the continuous mixture components, the non-differentiable mixture weight assignment is addressed by an unbiased estimator derived as a sum over the Monte Carlo samples and dimensions, yielding correct gradients with respect to mixture parameters (Graves, 2016). Crucially, this sidesteps the need to differentiate through the discrete random variable itself.

An orthogonal marginalization-based method achieves variance reduction for factorized graphical models by “clamping” each discrete latent variable $z_i$ in turn and exactly evaluating the inner expectation over all settings of $z_i$ , holding the remaining randomness fixed. Given $z_i$ sampled via a non-differentiable function $\rho(\phi_i, \epsilon_i)$ , this leads to:

$\nabla_{\phi_i}\mathbb{E}_\epsilon[f(x,\mathbf{z})] = \mathbb{E}_{\boldsymbol{\epsilon}_{-i}}\left[\sum_{z_i} f(x, \mathbf{z})\nabla_{\phi_i}\rho(z_{-i};\phi_i)\right]$

Employing common random numbers (shared $\boldsymbol{\epsilon}_{-i}$ ) ensures that the estimator’s variance is not greater than the likelihood-ratio baseline method (Tokui et al., 2016).

3. Continuous Relaxations: Gumbel-Softmax and Beyond

Continuous relaxation is foundational for pathwise gradient estimation with discrete latent variables.

Gumbel-Softmax/Concrete distribution: Sampling from a categorical is re-expressed as $y_i = \exp((\log \pi_i + g_i)/\tau)/\sum_j \exp((\log \pi_j + g_j)/\tau)$ , with $g_i \sim \text{Gumbel}(0,1)$ and $\tau$ a temperature hyperparameter. As $\tau \to 0$ , $y$ becomes nearly one-hot (discrete), but for nonzero $\tau$ the process is differentiable.
This relaxation enables backpropagation through discrete stochastic nodes, crucial for training VAEs with categorical latents, structured output models, and semi-supervised classifiers. The “straight-through” estimator applies a hard $\arg\max$ in forward pass but uses the relaxed sample for the backward pass (Jang et al., 2016).

However, continuous relaxations generally yield biased estimators and their objectives are not aligned with the true evidence lower bound (ELBO) for the discrete variable. They may also require careful temperature annealing and, in complex settings, may fail to produce accurate uncertainty estimates (Kuśmierczyk et al., 2020). Advances such as mixtures of discrete normalizing flows (MDNF) construct expressive, invertible transformations in discrete space and provide exact probability mass evaluation for the categorical latent, supporting unbiased ELBO optimization even under complex priors.

4. Advanced Control Variates, Couplings, and Rao-Blackwellization

Variance reduction remains a central concern. Multiple approaches combine reparameterization-based estimators with control variates:

REBAR: Constructs an unbiased estimator by subtracting a continuous relaxation-based control variate from the high-variance score-function estimator and correcting with a reparameterization trick term. Temperature of relaxation is adapted online to minimize estimator variance (Tucker et al., 2017).
Categorical reparameterization via stick-breaking and coupling: Categorical variables are mapped as sequences of binary stick-breaking decisions. Antithetic sampling and importance-weighted estimators are then constructed, and Rao-Blackwellization is used to analytically integrate out irrelevant noise, greatly reducing estimator variance. This framework generalizes previously efficient binary estimators (e.g., DisARM/U2G) to multi-category variables (Dong et al., 2021).

Rao-Blackwellization is further formalized in the context of reparameterization gradients for continuous models (R2-G2 estimator), but analogous reductions in discrete models are now being explored by conditioning on tractable subsets or auxiliary relaxations (Lam et al., 9 Jun 2025).

5. Application to Discrete Neural Representations and Compression

Discrete reparameterization methods are crucial for the training and deployment of neural networks with binary/ternary weights and/or activations:

Local Reparameterization Trick (LRT): Instead of directly sampling discrete weights, the network propagates the distribution over pre-activations via the CLT as Gaussians: $z_i \sim \mathcal{N}(\sum_j \mu_{ij} h_j, \sum_j \sigma_{ij}^2 h_j^2)$ . During training, gradients are taken with respect to the parameters of the underlying multinomial distributions encoding discrete weights (Shayer et al., 2017). This underpins memory-efficient, multiplication-free inference on resource-limited hardware.
Recent advances extend LRT to simultaneously handle discrete activations, employing Gumbel-Softmax-based binary thresholding, so both weights and activations are trained within a differentiable stochastic framework. The result is state-of-the-art accuracy for binarized networks in vision tasks while retaining high compression rates and computational efficiency (Berger et al., 2023).

Entropy-penalized reparameterization schemes assign a learnable, discrete latent space to network parameters; the latent representation is then compressed via arithmetic coding by imposing an entropy penalty in the objective during training. This produces high-accuracy, compressed models using only a single end-to-end optimization stage (Oktay et al., 2019).

6. Structured and Efficient Marginalization Strategies

For models with exponentially large discrete latent spaces, efficient yet exact marginalization is enabled by leveraging sparse mapping operators:

Sparsemax and SparseMAP: Replacing softmax with sparsemax induces distributions with many zero probabilities, drastically shrinking the support over which marginalization must be performed. For structured concatenations (e.g. bit-vectors), top- $k$ sparsemax and SparseMAP limit the effective sum to a polynomial (often $O(D)$ ) number of terms, allowing deterministic, low-variance gradients and making structured discrete VAEs practical even with large latent spaces (Correia et al., 2020).

This approach contrasts with stochastic sampling schemes or continuous relaxations, as it retains exactness in marginalization, provides variance reduction via determinism, and adapts the computational burden in response to model confidence.

7. Unbiasedness, Variance Guarantees, and Implications

A recurring concern in the literature is the bias–variance tradeoff. Certain techniques, such as direct marginalization with common random numbers (Tokui et al., 2016), unbiased stick-breaking/coupled estimators (Dong et al., 2021), or quantile-transform estimators for mixture weights (Graves, 2016), yield unbiased gradients with provably reduced variance relative to score-function (REINFORCE) or LR with optimal baseline. Theoretical results guarantee estimator variance is not greater than alternative estimators for discrete vertices, and empirical studies consistently demonstrate accelerated convergence, improved likelihood bounds, and superior final model performance.

A plausible implication is that combining Rao-Blackwellization, advanced control variates, or sparse marginalization with modern relaxations and surrogates further extends the regime in which discrete latent variable models can be trained at scale, with competitive performance to their continuous counterparts and interpretability or efficiency benefits unique to discrete representations.

In summary, discrete latent variable reparameterization is a broad and continually evolving field that underpins robust, scalable, and interpretable learning in probabilistic models containing discrete structure. Techniques are drawn from quantile-based transforms, marginalization via common random numbers, continuous relaxations, control variates, statistical couplings, and efficient sparse mappings, and are foundational for variational inference, structured generative modeling, neural compression, and deep discrete-represented architectures. The theoretical foundations and empirical results established in this literature provide a toolkit for addressing the inherent challenges imposed by discrete stochasticity in differentiable learning systems.