Contrastive Divergence: Algorithm & Analysis

Updated 8 March 2026

Contrastive Divergence is a stochastic approximation algorithm that uses short MCMC chains to estimate intractable gradients in unnormalized exponential-family models.
It replaces the intractable expectation in the log-likelihood gradient with samples from m-step MCMC kernels, balancing computational efficiency with controlled bias and variance.
Under regularity conditions with appropriately scaled MCMC steps and learning rates, CD achieves the parametric O(n⁻¹/²) convergence rate and asymptotic variance near the Cramér–Rao bound.

Contrastive Divergence (CD) is a widely used stochastic approximation algorithm for training unnormalized models—most notably Restricted Boltzmann Machines, general exponential-family graphical models, and contemporary neural energy-based models—by replacing the intractable expectation in the log-likelihood gradient with short Markov chains initialized at observed data. The method is characterized by its efficiency, favorable empirical performance, and the nuanced theoretical landscape of bias, consistency, and statistical optimality that has been developed to understand its properties.

1. Problem Formulation and Algorithmic Principle

Consider a minimal exponential family model with unnormalized density

$p_\psi(dx) = \exp\{\psi^\top \phi(x) - \log Z(\psi)\}\, c(dx)$

where $\phi(x)$ is the vector of sufficient statistics, $\psi \in \Psi \subset \mathbb{R}^p$ is the natural parameter, $c(\cdot)$ is a known base measure, and $Z(\psi)$ is the partition function. Given $n$ i.i.d. samples $X_1, \dots, X_n \sim p_{\psi^*}$ , the negative log-likelihood is

$L(\psi) = -\frac{1}{n}\sum_{i=1}^n \phi(X_i)^\top \psi + \log Z(\psi)$

whose gradient is

$\nabla L(\psi) = -\frac{1}{n}\sum_{i=1}^n \phi(X_i) + \mathbb{E}_{X\sim p_\psi}[\phi(X)]$

The expectation $\mathbb{E}_{p_\psi}[\phi(X)]$ is typically intractable for high-dimensional models.

Contrastive Divergence (CD) replaces the intractable model expectation by running $m$ -step MCMC kernels $k_\psi^{(m)}$ initialized at data $X_i$ , i.e., for iteration $t$ in online (single-sample) or offline (mini-batch) form,

Draw $\tilde X_t^m \sim k_\psi^{(m)}(X_t,\cdot)$
Compute $h_t = \phi(X_t) - \phi(\tilde X_t^m)$
Update: $\psi_t = \mathrm{Proj}_\Psi\left[\psi_{t-1} - \eta_t h_t\right]$

2. Regularity Assumptions and Statistical Setup

The non-asymptotic analysis of CD makes the following technical assumptions:

A1 (Regular exponential family): $\Psi$ convex and compact, $\psi^* \in \mathrm{int}\,\Psi$ , and $L$ is $\mu$ -strongly convex and $L$ -smooth: $0 < \mu \le \lambda_{\min}(\nabla^2 L(\psi)) \le \lambda_{\max}(\nabla^2 L(\psi)) \le L < \infty$ for all $\psi \in \Psi$ .
A2 ( $\chi^2$ -control): There exists $C_\chi < \infty$ such that for all $\psi \in \Psi$ ,

$\chi^2(p_{\psi^*}, p_\psi) = \int \left( \frac{dp_{\psi^*}}{dp_\psi} - 1 \right)^2 p_\psi(dx) \le C_\chi^2 \|\psi - \psi^*\|^2$

A3 (Restricted spectral gap): The $m$ -step MCMC kernel $k_\psi$ contracts $L^2$ -norms of $\phi$ and $\phi \otimes \phi$ :

$\alpha := \sup_{\psi \in \Psi, f \in \{\phi_i, \phi_i \phi_j\}} \alpha(f, \psi) < 1$

where $\alpha(f,\psi) = \|P_\psi f - \mathbb{E}_{p_\psi}[f]\|_{L^2}/\|f - \mathbb{E}_{p_\psi}[f]\|_{L^2}$ and $P_\psi$ is the MCMC transition operator.

These assumptions, together with analytic smoothness of $\log Z$ on $\Psi$ , ensure control over moments and MCMC bias/variance for the relevant statistics.

3. Non-Asymptotic Convergence Rates

The main result establishes that, under A1–A3, CD achieves the parametric $O(n^{-1/2})$ rate for parameter estimation, matching maximum-likelihood estimation under regularity:

Define

$\tilde\mu_m = \mu - \alpha^m \sigma C_\chi, \quad \tilde\sigma_m^2 = \sigma_*^2 + \sigma^2 + 2 \sigma^2 \alpha^{2m}$

where $\sigma_*^2 = \mathrm{Var}_{p_{\psi^*}}[\phi(X)]$ , and $\sigma^2 = \sup_{\psi \in \Psi} \mathrm{Var}_{p_\psi}[\phi(X)]$ .

For step-size $\eta_t = C t^{-1}$ with $C > 2/\tilde\mu_m$ , and for $m > \log(\sigma C_\chi / \mu)/|\log \alpha|$ , the mean-squared error satisfies

$\mathbb{E}\|\psi_n - \psi^*\|^2 \le \frac{4C^2 \tilde\sigma_m^2}{\tilde\mu_m(\tilde\mu_m C - 2)} n^{-1} + o(n^{-1})$

Therefore, $\|\psi_n - \psi^*\| = O(n^{-1/2})$ as $n \to \infty$ (Glaser et al., 15 Oct 2025).

This result closes the gap with previous analyses, which established only $O(n^{-1/3})$ rates for batch CD under more restrictive assumptions (Jiang et al., 2016, Jiang et al., 2016).

4. Batching Regimes: Online, Minibatch, and SGD Variants

The analysis applies to various data presentation and batching schemes:

Fully online (batch size $B=1$ ): The $O(n^{-1/2})$ rate holds directly, decomposing the bias and variance through stochastic updates.
Minibatch/Offline CD: Minibatches $B_{t,j}$ of size $B$ are processed in $N = \lceil n/B \rceil$ batches per epoch. With mild additional moment conditions ( $\nu > 2$ moments on $\phi(k_\psi^m(X))$ ), and step sizes $\eta_t = C t^{-\beta}$ , $0 < \beta \le 1$ , one obtains for the iterates $\psi_{T,N}$ after $T$ epochs and $N$ batches per epoch:

$\sqrt{\mathbb{E}\|\psi_{T,N} - \psi^*\|^2} \leq E_1^{T,N}\sqrt{\delta_0} + C\sigma_{n,T}/\tilde\mu_m \times O(1)$

with $\sigma_{n,T} = \epsilon_{n,m,T} + O(B^{-1/2})$ , and $E_1$ vanishing with increasing $T$ . As $T \to \infty$ , the bound reduces to order $(\log n)^{1/2} n^{-1/2}$ for subexponential tails but is polynomially slower if only weaker moments are available (Glaser et al., 15 Oct 2025).

With- and Without-Replacement Sampling: The asymptotic rate is unaffected: $(\log n)^{1/2} n^{-1/2}$ is attainable under both with- and without-replacement (SGDw/SGDo) sampling.

5. Asymptotic Variance and Near-Optimality

Average iterates using the Polyak–Ruppert scheme, i.e., $\bar{\psi}_n = \frac{1}{n} \sum_{t=1}^n \psi_t$ (with $\eta_t = C t^{-\beta}$ , $\beta \in (1/2, 1)$ ), achieve near-optimal asymptotic variance:

If $m = O(\log n)$ (i.e., $m > [(1-\beta)/2|\log \alpha|] \log n$ ), then

$\sqrt{\mathbb{E}\|\bar{\psi}_n - \psi^*\|^2} \leq 2 \sqrt{ \frac{ \mathrm{tr} ( I(\psi^*)^{-1} ) }{ n } } + o(n^{-1/2})$

where $I(\psi^*) = \mathrm{Cov}_{p_{\psi^*}}[\phi(X)]$ is the Fisher information (Glaser et al., 15 Oct 2025).

The asymptotic variance is therefore within a factor 4 of the Cramér–Rao lower bound for fully-observed exponential families.

This establishes that, provided sufficiently long MCMC chains (scaling logarithmically with $n$ ), CD is statistically near-optimal among unbiased estimators.

6. Proof Structure and Technical Innovations

The theoretical analysis is built on several key ideas:

Establishing recursion for $\delta_t = \mathbb{E}\|\psi_t - \psi^*\|^2$ that separates contraction (from convexity) from bias (finite $m$ MCMC) and variance (stochastic sampling).
Controlling MCMC bias by leveraging the $\chi^2$ -divergence bound (A2) to relate $\mathbb{E}_{p_{\psi*}}[f] - \mathbb{E}_{p_\psi}[f]$ to $\|\psi - \psi^*\|$ , and the spectral gap (A3) to contract deviations of $\phi$ under $k_\psi^m$ .
Handling batch correlations, especially in offline CD, by bounding empirical process deviations via covering numbers for subexponential tails, or by Markov’s inequality for polynomial moments, yielding $(\log n)^{1/2} n^{-1/2}$ rates under suitable conditions.
For averaged iterates, invoking Polyak–Ruppert mixing to achieve variance reduction and ensure asymptotic efficiency on par with SGD analyses (Glaser et al., 15 Oct 2025).

7. Practical Implications and Limitations

The main practical recommendations and boundaries are as follows:

Regime for optimality: Achieving $O(n^{-1/2})$ rates and near–Cramér–Rao efficiency requires $m = \Omega(\log n)$ MCMC steps and learning rates decaying as $\eta_t \sim c/t$ .
Spectral gap and mixing: The advantage of CD depends critically on the gap condition (A3) for MCMC kernels contracting the statistics of interest. If $p_\psi$ is highly multimodal or heavy-tailed and the MCMC kernel mixes slowly, $\alpha \approx 1$ , so much longer $m$ may be required or the condition may fail.
Model class constraints: The analysis is for (unnormalized) exponential families; general EBMs with nonlinear sufficient statistics require more stringent assumptions and analysis.
Finite- $n$ behavior: Constants such as $C_\chi, \sigma, \alpha, \mu, L$ may be unfavorable in practical, high-dimensional settings, affecting the observed rate and necessitating careful hyperparameter selection.
Extensions and open directions: Open questions include relaxing convex-compactness, analysis for persistent CD, implementation guidelines for optimal $m$ versus $\eta_t$ , and extension to broader EBM settings.

In summary, recent advances have demonstrated that, under mild structural and mixing conditions and with sufficiently long chains, Contrastive Divergence achieves the minimax parametric rate $O(n^{-1/2})$ and asymptotic variance within a factor of the Cramér–Rao bound, justifying its use as a near-optimal training method for exponential-family unnormalized models (Glaser et al., 15 Oct 2025).