Local Reparameterization Trick

Updated 2 April 2026

Local Reparameterization Trick is a variance reduction technique that transfers stochasticity from global model parameters to local layer activations.
It achieves lower gradient variance by sampling independent noise per activation, leveraging Gaussian assumptions and the central limit theorem.
The method enhances computational efficiency and scalability in Bayesian neural networks and low-precision settings through theoretical guarantees like Rao-Blackwellisation.

The local reparameterization trick (LRT) is a variance reduction technique for stochastic gradient-based variational inference in neural networks with latent Gaussian (or certain discrete) parameterizations. LRT achieves substantial variance reduction in gradient estimators by shifting stochasticity from model parameters ("global noise") to layer activations ("local noise"), enabling efficient, highly parallelizable, and lower-variance training for Bayesian neural networks and quantized network variants. The trick builds on the central limit theorem and Rao-Blackwellisation, providing both theoretical clarity and practical speedups in scalable Bayesian deep learning (Kingma et al., 2015, Shayer et al., 2017, Lam et al., 9 Jun 2025, Berger et al., 2023).

1. Variational Inference and Gradient Variance in Neural Networks

Stochastic gradient variational Bayes (SGVB) introduces a variational posterior $q_\phi(w)$ over neural network weights $w$ , with the evidence lower bound (ELBO)

$\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$

Typically, global reparameterization samples $w$ once per minibatch using $w = \mu + \sigma \odot \epsilon$ , $\epsilon \sim \mathcal{N}(0,1)$ . The resulting gradients with respect to variational parameters $\phi$ are

$\nabla_\phi \widehat{\mathcal{L}}_{\mathrm{SGVB}} = \frac{N}{M} \sum_{m=1}^M \nabla_\phi \log p(y^{i_m}|x^{i_m}, w),$

where high covariance between minibatch elements induces an irreducible variance floor that does not diminish as the minibatch size $M$ increases (Kingma et al., 2015).

2. Local Reparameterization Trick: Noise Transfer from Parameters to Activations

LRT exploits the property that, if the posterior in a linear layer is fully factorized Gaussian, the pre-activations $b_{m,j} = \sum_{i} A_{m,i} w_{ij}$ (for batch input $w$ 0 and weights $w$ 1) are marginally Gaussian:

$w$ 2

Instead of sampling all weights $w$ 3 for each example in the minibatch, LRT samples one independent $w$ 4 per activation and reconstructs $w$ 5. This procedure ensures that noise is local to each data point and neuron, rendering covariance between data points zero. Consequently,

$w$ 6

which scales the gradient variance inversely with batch size and eliminates the variance floor seen in global parameter sampling (Kingma et al., 2015, Lam et al., 9 Jun 2025).

3. Computational Efficiency and Parallelization

The LRT reduces computational complexity and memory footprint. Global reparameterization requires $w$ 7 random samples and matrix multiplications for a batch size $w$ 8, input dimension $w$ 9, and output dimension $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 0. LRT only requires $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 1 standard normal samples and a single batched matrix multiplication, as noise is injected post-aggregation at the activation level. This local noise structure makes the approach fully parallelizable, fitting well with BLAS routines and GPU computation (Kingma et al., 2015). Empirical results showed a $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 2 speedup on GPU versus per-datapoint weight sampling, reducing epoch time from $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 31,600\,s to $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 47.4\,s in large-scale experiments.

4. Rao-Blackwellisation and Theoretical Foundations

The variance reduction property of LRT can be formalized through the Rao-Blackwellised reparameterization gradient estimator (R2-G2) (Lam et al., 9 Jun 2025). In linear Gaussian models, conditioning on the pre-activations (sufficient statistics for each data point) provides a closed-form conditional expectation. The R2-G2 estimator computes

$\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 5

where $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 6 is the pre-activation, and $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 7 is the standard reparameterization gradient. This conditioning collapses multi-dimensional global noise into independent local scalar noise per activation, provably reducing variance with minimal increase in computation. For models with linear layers and independent Gaussian weights, LRT coincides exactly with the R2-G2 estimator (Lam et al., 9 Jun 2025).

5. Applications to Discrete Weights and Activations

The LRT has been extended to settings with discrete weights and activations, facilitating low-precision neural network training suited for hardware efficiency (Shayer et al., 2017, Berger et al., 2023). For binary or ternary weights, each weight $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 8 is treated as a discrete random variable. By the central limit theorem, the pre-activation $\mathcal{L}(\phi) = \mathbb{E}_{w \sim q_\phi}[ \sum_{i=1}^N \log p(y^i|x^i, w)] - \mathrm{KL}(q_\phi(w) \parallel p(w)).$ 9 is Gaussian for large input dimension, and the layer output is sampled as $w$ 0, where $w$ 1 and $w$ 2 are the mean and variance of $w$ 3 under the discrete distribution. This approach provides a differentiable, low-variance surrogate objective and significantly reduces gradient estimator variance compared to naïve REINFORCE or Gumbel-softmax relaxations.

LRT has also been combined with the Gumbel-softmax trick for binarized activations. For an activation $w$ 4, the probability $w$ 5 allows for a continuous relaxation in backpropagation, further extending the variance reduction and computational benefits to the binarized activations regime (Berger et al., 2023).

6. Empirical Effects and Variational Dropout

Empirical evaluation has demonstrated the effectiveness of LRT for both model performance and efficiency. Local reparameterization attains an order of magnitude lower gradient variance than global sampling, leading to much faster convergence. On benchmarks such as MNIST and CIFAR-10, variational dropout trained with LRT matches or outperforms traditional dropout or fixed-variance Gaussian dropout, especially in resource-limited model regimes (Kingma et al., 2015). For low-precision discrete networks, LRT-based methods approach full-precision performance with significant computational and storage reductions (Shayer et al., 2017, Berger et al., 2023).

Variational dropout extends these principles by learning dropout rates as variational parameters, yielding a posterior $w$ 6 with the ELBO simplified under a log-uniform prior. This enables model- and layer-specific noise injection rates, optimized via SGD for maximal generalization (Kingma et al., 2015).

7. Connections to Dropout and Practical Implementation

Gaussian dropout can be interpreted as a special case of LRT with a fixed variance, where the corresponding variational posterior is $w$ 7 and the KL-divergence regularization with respect to the prior can be made scale-invariant via a log-uniform prior. This renders the dropout objective equivalent to maximizing the evidence lower bound for the corresponding variational formulation (Kingma et al., 2015).

Implementation of LRT typically involves precomputing mean and variance at the activation layer, sampling standard Gaussians per activation, forming noisy activations, and propagating these through the rest of the network. Backpropagation proceeds via the chain rule through both mean and variance with gradients scaling according to injected noise. Practical recommendations include using LRT whenever possible in Bayesian neural networks and VAE decoders, particularly in multilayer structures; on single-layer encoders, excessive variance reduction may be suboptimal for representational dynamics (Lam et al., 9 Jun 2025). In binarized settings, entropy-regularizing penalties on logits preserve stochasticity essential for effective Gaussian approximation (Berger et al., 2023).

In summary, the local reparameterization trick is a foundational technique for scalable, low-variance variational inference in modern neural architectures, offering theoretical guarantees via Rao-Blackwellisation and broadening the landscape for efficient Bayesian deep learning and quantized network deployment (Kingma et al., 2015, Shayer et al., 2017, Lam et al., 9 Jun 2025, Berger et al., 2023).