Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Divergence: Algorithm & Analysis

Updated 8 March 2026
  • Contrastive Divergence is a stochastic approximation algorithm that uses short MCMC chains to estimate intractable gradients in unnormalized exponential-family models.
  • It replaces the intractable expectation in the log-likelihood gradient with samples from m-step MCMC kernels, balancing computational efficiency with controlled bias and variance.
  • Under regularity conditions with appropriately scaled MCMC steps and learning rates, CD achieves the parametric O(n⁻¹/²) convergence rate and asymptotic variance near the Cramér–Rao bound.

Contrastive Divergence (CD) is a widely used stochastic approximation algorithm for training unnormalized models—most notably Restricted Boltzmann Machines, general exponential-family graphical models, and contemporary neural energy-based models—by replacing the intractable expectation in the log-likelihood gradient with short Markov chains initialized at observed data. The method is characterized by its efficiency, favorable empirical performance, and the nuanced theoretical landscape of bias, consistency, and statistical optimality that has been developed to understand its properties.

1. Problem Formulation and Algorithmic Principle

Consider a minimal exponential family model with unnormalized density

pψ(dx)=exp{ψϕ(x)logZ(ψ)}c(dx)p_\psi(dx) = \exp\{\psi^\top \phi(x) - \log Z(\psi)\}\, c(dx)

where ϕ(x)\phi(x) is the vector of sufficient statistics, ψΨRp\psi \in \Psi \subset \mathbb{R}^p is the natural parameter, c()c(\cdot) is a known base measure, and Z(ψ)Z(\psi) is the partition function. Given nn i.i.d. samples X1,,XnpψX_1, \dots, X_n \sim p_{\psi^*}, the negative log-likelihood is

L(ψ)=1ni=1nϕ(Xi)ψ+logZ(ψ)L(\psi) = -\frac{1}{n}\sum_{i=1}^n \phi(X_i)^\top \psi + \log Z(\psi)

whose gradient is

L(ψ)=1ni=1nϕ(Xi)+EXpψ[ϕ(X)]\nabla L(\psi) = -\frac{1}{n}\sum_{i=1}^n \phi(X_i) + \mathbb{E}_{X\sim p_\psi}[\phi(X)]

The expectation Epψ[ϕ(X)]\mathbb{E}_{p_\psi}[\phi(X)] is typically intractable for high-dimensional models.

Contrastive Divergence (CD) replaces the intractable model expectation by running mm-step MCMC kernels kψ(m)k_\psi^{(m)} initialized at data XiX_i, i.e., for iteration tt in online (single-sample) or offline (mini-batch) form,

  • Draw X~tmkψ(m)(Xt,)\tilde X_t^m \sim k_\psi^{(m)}(X_t,\cdot)
  • Compute ht=ϕ(Xt)ϕ(X~tm)h_t = \phi(X_t) - \phi(\tilde X_t^m)
  • Update: ψt=ProjΨ[ψt1ηtht]\psi_t = \mathrm{Proj}_\Psi\left[\psi_{t-1} - \eta_t h_t\right]

2. Regularity Assumptions and Statistical Setup

The non-asymptotic analysis of CD makes the following technical assumptions:

  • A1 (Regular exponential family): Ψ\Psi convex and compact, ψintΨ\psi^* \in \mathrm{int}\,\Psi, and LL is μ\mu-strongly convex and LL-smooth: 0<μλmin(2L(ψ))λmax(2L(ψ))L<0 < \mu \le \lambda_{\min}(\nabla^2 L(\psi)) \le \lambda_{\max}(\nabla^2 L(\psi)) \le L < \infty for all ψΨ\psi \in \Psi.
  • A2 (χ2\chi^2-control): There exists Cχ<C_\chi < \infty such that for all ψΨ\psi \in \Psi,

χ2(pψ,pψ)=(dpψdpψ1)2pψ(dx)Cχ2ψψ2\chi^2(p_{\psi^*}, p_\psi) = \int \left( \frac{dp_{\psi^*}}{dp_\psi} - 1 \right)^2 p_\psi(dx) \le C_\chi^2 \|\psi - \psi^*\|^2

  • A3 (Restricted spectral gap): The mm-step MCMC kernel kψk_\psi contracts L2L^2-norms of ϕ\phi and ϕϕ\phi \otimes \phi:

α:=supψΨ,f{ϕi,ϕiϕj}α(f,ψ)<1\alpha := \sup_{\psi \in \Psi, f \in \{\phi_i, \phi_i \phi_j\}} \alpha(f, \psi) < 1

where α(f,ψ)=PψfEpψ[f]L2/fEpψ[f]L2\alpha(f,\psi) = \|P_\psi f - \mathbb{E}_{p_\psi}[f]\|_{L^2}/\|f - \mathbb{E}_{p_\psi}[f]\|_{L^2} and PψP_\psi is the MCMC transition operator.

These assumptions, together with analytic smoothness of logZ\log Z on Ψ\Psi, ensure control over moments and MCMC bias/variance for the relevant statistics.

3. Non-Asymptotic Convergence Rates

The main result establishes that, under A1–A3, CD achieves the parametric O(n1/2)O(n^{-1/2}) rate for parameter estimation, matching maximum-likelihood estimation under regularity:

  • Define

μ~m=μαmσCχ,σ~m2=σ2+σ2+2σ2α2m\tilde\mu_m = \mu - \alpha^m \sigma C_\chi, \quad \tilde\sigma_m^2 = \sigma_*^2 + \sigma^2 + 2 \sigma^2 \alpha^{2m}

where σ2=Varpψ[ϕ(X)]\sigma_*^2 = \mathrm{Var}_{p_{\psi^*}}[\phi(X)], and σ2=supψΨVarpψ[ϕ(X)]\sigma^2 = \sup_{\psi \in \Psi} \mathrm{Var}_{p_\psi}[\phi(X)].

  • For step-size ηt=Ct1\eta_t = C t^{-1} with C>2/μ~mC > 2/\tilde\mu_m, and for m>log(σCχ/μ)/logαm > \log(\sigma C_\chi / \mu)/|\log \alpha|, the mean-squared error satisfies

Eψnψ24C2σ~m2μ~m(μ~mC2)n1+o(n1)\mathbb{E}\|\psi_n - \psi^*\|^2 \le \frac{4C^2 \tilde\sigma_m^2}{\tilde\mu_m(\tilde\mu_m C - 2)} n^{-1} + o(n^{-1})

Therefore, ψnψ=O(n1/2)\|\psi_n - \psi^*\| = O(n^{-1/2}) as nn \to \infty (Glaser et al., 15 Oct 2025).

This result closes the gap with previous analyses, which established only O(n1/3)O(n^{-1/3}) rates for batch CD under more restrictive assumptions (Jiang et al., 2016, Jiang et al., 2016).

4. Batching Regimes: Online, Minibatch, and SGD Variants

The analysis applies to various data presentation and batching schemes:

  • Fully online (batch size B=1B=1): The O(n1/2)O(n^{-1/2}) rate holds directly, decomposing the bias and variance through stochastic updates.
  • Minibatch/Offline CD: Minibatches Bt,jB_{t,j} of size BB are processed in N=n/BN = \lceil n/B \rceil batches per epoch. With mild additional moment conditions (ν>2\nu > 2 moments on ϕ(kψm(X))\phi(k_\psi^m(X))), and step sizes ηt=Ctβ\eta_t = C t^{-\beta}, 0<β10 < \beta \le 1, one obtains for the iterates ψT,N\psi_{T,N} after TT epochs and NN batches per epoch:

EψT,Nψ2E1T,Nδ0+Cσn,T/μ~m×O(1)\sqrt{\mathbb{E}\|\psi_{T,N} - \psi^*\|^2} \leq E_1^{T,N}\sqrt{\delta_0} + C\sigma_{n,T}/\tilde\mu_m \times O(1)

with σn,T=ϵn,m,T+O(B1/2)\sigma_{n,T} = \epsilon_{n,m,T} + O(B^{-1/2}), and E1E_1 vanishing with increasing TT. As TT \to \infty, the bound reduces to order (logn)1/2n1/2(\log n)^{1/2} n^{-1/2} for subexponential tails but is polynomially slower if only weaker moments are available (Glaser et al., 15 Oct 2025).

  • With- and Without-Replacement Sampling: The asymptotic rate is unaffected: (logn)1/2n1/2(\log n)^{1/2} n^{-1/2} is attainable under both with- and without-replacement (SGDw/SGDo) sampling.

5. Asymptotic Variance and Near-Optimality

Average iterates using the Polyak–Ruppert scheme, i.e., ψˉn=1nt=1nψt\bar{\psi}_n = \frac{1}{n} \sum_{t=1}^n \psi_t (with ηt=Ctβ\eta_t = C t^{-\beta}, β(1/2,1)\beta \in (1/2, 1)), achieve near-optimal asymptotic variance:

  • If m=O(logn)m = O(\log n) (i.e., m>[(1β)/2logα]lognm > [(1-\beta)/2|\log \alpha|] \log n), then

Eψˉnψ22tr(I(ψ)1)n+o(n1/2)\sqrt{\mathbb{E}\|\bar{\psi}_n - \psi^*\|^2} \leq 2 \sqrt{ \frac{ \mathrm{tr} ( I(\psi^*)^{-1} ) }{ n } } + o(n^{-1/2})

where I(ψ)=Covpψ[ϕ(X)]I(\psi^*) = \mathrm{Cov}_{p_{\psi^*}}[\phi(X)] is the Fisher information (Glaser et al., 15 Oct 2025).

  • The asymptotic variance is therefore within a factor 4 of the Cramér–Rao lower bound for fully-observed exponential families.

This establishes that, provided sufficiently long MCMC chains (scaling logarithmically with nn), CD is statistically near-optimal among unbiased estimators.

6. Proof Structure and Technical Innovations

The theoretical analysis is built on several key ideas:

  • Establishing recursion for δt=Eψtψ2\delta_t = \mathbb{E}\|\psi_t - \psi^*\|^2 that separates contraction (from convexity) from bias (finite mm MCMC) and variance (stochastic sampling).
  • Controlling MCMC bias by leveraging the χ2\chi^2-divergence bound (A2) to relate Epψ[f]Epψ[f]\mathbb{E}_{p_{\psi*}}[f] - \mathbb{E}_{p_\psi}[f] to ψψ\|\psi - \psi^*\|, and the spectral gap (A3) to contract deviations of ϕ\phi under kψmk_\psi^m.
  • Handling batch correlations, especially in offline CD, by bounding empirical process deviations via covering numbers for subexponential tails, or by Markov’s inequality for polynomial moments, yielding (logn)1/2n1/2(\log n)^{1/2} n^{-1/2} rates under suitable conditions.
  • For averaged iterates, invoking Polyak–Ruppert mixing to achieve variance reduction and ensure asymptotic efficiency on par with SGD analyses (Glaser et al., 15 Oct 2025).

7. Practical Implications and Limitations

The main practical recommendations and boundaries are as follows:

  • Regime for optimality: Achieving O(n1/2)O(n^{-1/2}) rates and near–Cramér–Rao efficiency requires m=Ω(logn)m = \Omega(\log n) MCMC steps and learning rates decaying as ηtc/t\eta_t \sim c/t.
  • Spectral gap and mixing: The advantage of CD depends critically on the gap condition (A3) for MCMC kernels contracting the statistics of interest. If pψp_\psi is highly multimodal or heavy-tailed and the MCMC kernel mixes slowly, α1\alpha \approx 1, so much longer mm may be required or the condition may fail.
  • Model class constraints: The analysis is for (unnormalized) exponential families; general EBMs with nonlinear sufficient statistics require more stringent assumptions and analysis.
  • Finite-nn behavior: Constants such as Cχ,σ,α,μ,LC_\chi, \sigma, \alpha, \mu, L may be unfavorable in practical, high-dimensional settings, affecting the observed rate and necessitating careful hyperparameter selection.
  • Extensions and open directions: Open questions include relaxing convex-compactness, analysis for persistent CD, implementation guidelines for optimal mm versus ηt\eta_t, and extension to broader EBM settings.

In summary, recent advances have demonstrated that, under mild structural and mixing conditions and with sufficiently long chains, Contrastive Divergence achieves the minimax parametric rate O(n1/2)O(n^{-1/2}) and asymptotic variance within a factor of the Cramér–Rao bound, justifying its use as a near-optimal training method for exponential-family unnormalized models (Glaser et al., 15 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Divergence (CD).