Contrastive Divergence: Algorithm & Analysis
- Contrastive Divergence is a stochastic approximation algorithm that uses short MCMC chains to estimate intractable gradients in unnormalized exponential-family models.
- It replaces the intractable expectation in the log-likelihood gradient with samples from m-step MCMC kernels, balancing computational efficiency with controlled bias and variance.
- Under regularity conditions with appropriately scaled MCMC steps and learning rates, CD achieves the parametric O(n⁻¹/²) convergence rate and asymptotic variance near the Cramér–Rao bound.
Contrastive Divergence (CD) is a widely used stochastic approximation algorithm for training unnormalized models—most notably Restricted Boltzmann Machines, general exponential-family graphical models, and contemporary neural energy-based models—by replacing the intractable expectation in the log-likelihood gradient with short Markov chains initialized at observed data. The method is characterized by its efficiency, favorable empirical performance, and the nuanced theoretical landscape of bias, consistency, and statistical optimality that has been developed to understand its properties.
1. Problem Formulation and Algorithmic Principle
Consider a minimal exponential family model with unnormalized density
where is the vector of sufficient statistics, is the natural parameter, is a known base measure, and is the partition function. Given i.i.d. samples , the negative log-likelihood is
whose gradient is
The expectation is typically intractable for high-dimensional models.
Contrastive Divergence (CD) replaces the intractable model expectation by running -step MCMC kernels initialized at data , i.e., for iteration in online (single-sample) or offline (mini-batch) form,
- Draw
- Compute
- Update:
2. Regularity Assumptions and Statistical Setup
The non-asymptotic analysis of CD makes the following technical assumptions:
- A1 (Regular exponential family): convex and compact, , and is -strongly convex and -smooth: for all .
- A2 (-control): There exists such that for all ,
- A3 (Restricted spectral gap): The -step MCMC kernel contracts -norms of and :
where and is the MCMC transition operator.
These assumptions, together with analytic smoothness of on , ensure control over moments and MCMC bias/variance for the relevant statistics.
3. Non-Asymptotic Convergence Rates
The main result establishes that, under A1–A3, CD achieves the parametric rate for parameter estimation, matching maximum-likelihood estimation under regularity:
- Define
where , and .
- For step-size with , and for , the mean-squared error satisfies
Therefore, as (Glaser et al., 15 Oct 2025).
This result closes the gap with previous analyses, which established only rates for batch CD under more restrictive assumptions (Jiang et al., 2016, Jiang et al., 2016).
4. Batching Regimes: Online, Minibatch, and SGD Variants
The analysis applies to various data presentation and batching schemes:
- Fully online (batch size ): The rate holds directly, decomposing the bias and variance through stochastic updates.
- Minibatch/Offline CD: Minibatches of size are processed in batches per epoch. With mild additional moment conditions ( moments on ), and step sizes , , one obtains for the iterates after epochs and batches per epoch:
with , and vanishing with increasing . As , the bound reduces to order for subexponential tails but is polynomially slower if only weaker moments are available (Glaser et al., 15 Oct 2025).
- With- and Without-Replacement Sampling: The asymptotic rate is unaffected: is attainable under both with- and without-replacement (SGDw/SGDo) sampling.
5. Asymptotic Variance and Near-Optimality
Average iterates using the Polyak–Ruppert scheme, i.e., (with , ), achieve near-optimal asymptotic variance:
- If (i.e., ), then
where is the Fisher information (Glaser et al., 15 Oct 2025).
- The asymptotic variance is therefore within a factor 4 of the Cramér–Rao lower bound for fully-observed exponential families.
This establishes that, provided sufficiently long MCMC chains (scaling logarithmically with ), CD is statistically near-optimal among unbiased estimators.
6. Proof Structure and Technical Innovations
The theoretical analysis is built on several key ideas:
- Establishing recursion for that separates contraction (from convexity) from bias (finite MCMC) and variance (stochastic sampling).
- Controlling MCMC bias by leveraging the -divergence bound (A2) to relate to , and the spectral gap (A3) to contract deviations of under .
- Handling batch correlations, especially in offline CD, by bounding empirical process deviations via covering numbers for subexponential tails, or by Markov’s inequality for polynomial moments, yielding rates under suitable conditions.
- For averaged iterates, invoking Polyak–Ruppert mixing to achieve variance reduction and ensure asymptotic efficiency on par with SGD analyses (Glaser et al., 15 Oct 2025).
7. Practical Implications and Limitations
The main practical recommendations and boundaries are as follows:
- Regime for optimality: Achieving rates and near–Cramér–Rao efficiency requires MCMC steps and learning rates decaying as .
- Spectral gap and mixing: The advantage of CD depends critically on the gap condition (A3) for MCMC kernels contracting the statistics of interest. If is highly multimodal or heavy-tailed and the MCMC kernel mixes slowly, , so much longer may be required or the condition may fail.
- Model class constraints: The analysis is for (unnormalized) exponential families; general EBMs with nonlinear sufficient statistics require more stringent assumptions and analysis.
- Finite- behavior: Constants such as may be unfavorable in practical, high-dimensional settings, affecting the observed rate and necessitating careful hyperparameter selection.
- Extensions and open directions: Open questions include relaxing convex-compactness, analysis for persistent CD, implementation guidelines for optimal versus , and extension to broader EBM settings.
In summary, recent advances have demonstrated that, under mild structural and mixing conditions and with sufficiently long chains, Contrastive Divergence achieves the minimax parametric rate and asymptotic variance within a factor of the Cramér–Rao bound, justifying its use as a near-optimal training method for exponential-family unnormalized models (Glaser et al., 15 Oct 2025).