Papers
Topics
Authors
Recent
Search
2000 character limit reached

Score-Matching in Semi-Implicit VI

Updated 24 January 2026
  • The paper introduces a score-matching approach (SIVI-SM) that directly optimizes the score function for semi-implicit variational distributions.
  • It replaces the conventional KL divergence with Fisher divergence, enabling tractable, MCMC-free training despite intractable marginal densities.
  • Empirical and theoretical extensions, including hierarchical and kernelized methods, demonstrate improved mode capture, accelerated diffusion sampling, and convergence guarantees.

Score-matching approaches to semi-implicit variational inference (SIVI-SM) constitute a principled methodology for training semi-implicit variational families by leveraging the Fisher (score) divergence instead of the conventional Kullback-Leibler (KL) divergence or surrogate evidence lower bounds (ELBOs). SIVI-SM directly optimizes over the score function, exploiting the hierarchical or semi-implicit construction of the variational distributions. The approach eliminates the need for tractable densities of marginals and enables scalable, MCMC-free training for expressive approximations of complicated posteriors. Hierarchical and kernelized extensions further expand SIVI-SM’s flexibility and computational efficiency.

1. Semi-Implicit Variational Inference and Intractable Densities

Traditional variational inference restricts the variational family qφ(z)q_\varphi(z) to tractable, explicit distributions, optimizing

ELBO(φ)=Ezqφ[logp(x,z)logqφ(z)].\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].

SIVI generalizes this by introducing a “mixing” variable ε\varepsilon, yielding a semi-implicit family,

qφ(z)=qϕ(zε)qξ(ε)dε,φ=(ϕ,ξ),q_\varphi(z) = \int q_\phi(z\mid\varepsilon) \, q_\xi(\varepsilon) \, d\varepsilon, \quad \varphi=(\phi,\xi),

where either the prior qξq_\xi or the conditional qϕ(zε)q_\phi(z\mid\varepsilon) is chosen to be implicit, and at least one distribution remains explicit and reparameterizable. Unless conjugacies exist, the marginal qφ(z)q_\varphi(z) and its log-density are intractable, precluding standard ELBO gradients. Prior strategies circumvent this by using multi-sample surrogate bounds or expensive inner-loop MCMC, which introduces bias or large computational overhead (Yu et al., 2023).

2. Fisher Divergence and Score-Matching Objective

SIVI-SM replaces the KL divergence between variational and target densities with the Fisher (score) divergence: J(φ)=Ezqφzlogqφ(z)zlogp(x,z)22.J(\varphi) = \mathbb{E}_{z\sim q_\varphi}\left\|\,\nabla_z\log q_\varphi(z) - \nabla_z\log p(x,z)\,\right\|^2_2. Hyvärinen’s continuous score-matching formulation classically requires trace computations of vector fields, but SIVI-SM leverages denoising score matching and the structure of the hierarchy to obtain a tractable expectation. The key adaptation is the absence of explicit data corruption; instead, SIVI-SM analytically computes the score of the marginal through the mixture structure: zlogqφ(z)=1qφ(z)qξ(ε)qϕ(zε)zlogqϕ(zε)dε.\nabla_z\log q_\varphi(z) = \frac{1}{q_\varphi(z)}\int q_\xi(\varepsilon)\, q_\phi(z\mid\varepsilon)\, \nabla_z \log q_\phi(z\mid\varepsilon)\, d\varepsilon. All expectations are taken over samples from the generative hierarchy, ensuring scalability and avoiding separate MCMC estimation of gradients (Yu et al., 2023).

3. Minimax Formulation and Critic Network

To bypass intractable marginal scores, SIVI-SM recasts the training in a minimax formulation: minφ=(ϕ,ξ)  maxf  Ezqφ[2f(z)(zlogp(x,z)zlogqϕ(zε))f(z)2].\min_{\varphi=(\phi,\xi)} \; \max_{f} \; \mathbb{E}_{z\sim q_\varphi} \left[ 2 f(z)^\top ( \nabla_z \log p(x,z) - \nabla_z \log q_\phi(z\mid\varepsilon) ) - \|f(z)\|^2 \right ]. Here, ff (the critic) can be a neural network, targeting optimality at f(z)=zlogp(x,z)zlogqφ(z)f^*(z) = \nabla_z \log p(x,z) - \nabla_z \log q_\varphi(z). The score-matching objective reduces to the Fisher divergence at the Nash equilibrium. All relevant expectations are tractable via hierarchical reparameterization. The SIVI-SM training algorithm alternates between updating the variational parameters and the critic’s parameters by stochastic optimization, using samples generated via the semi-implicit construction (Yu et al., 2023).

Pseudocode Outline (SIVI-SM Minimix)

  • Sample mixing variables and conditional noise
  • Generate zz samples using hierarchical reparameterization
  • Compute target scores zlogp(x,z)\nabla_z \log p(x,z) and conditional scores zlogqϕ(zε)\nabla_z\log q_\phi(z\mid\varepsilon)
  • Update φ\varphi (via minimization) and ff (via maximization) using stochastic gradients

4. Hierarchical and Layer-wise Score-Matching

In complex settings where a single semi-implicit layer is insufficiently expressive, Hierarchical SIVI (HSIVI) stacks multiple conditional sampling layers:

  • qT(zT)q_T(z_T): explicit base
  • qT1(zT1zT;ϕT1)q_{T-1}(z_{T-1}\mid z_T; \phi_{T-1}), ..., q0(z0z1;ϕ0)q_0(z_0\mid z_1; \phi_0)

The marginal at each layer tt is recursively defined as

qt(zt;ϕt)=qt(ztzt+1;ϕt)  qt+1(zt+1;ϕt+1)dzt+1.q_t(z_t; \phi_{\geq t}) = \int q_t(z_t \mid z_{t+1}; \phi_t) \; q_{t+1}(z_{t+1}; \phi_{\geq t+1}) \, dz_{t+1}.

To facilitate optimization, HSIVI utilizes a bridging sequence p0p1pT1=pbasep_0\leftarrow p_1\leftarrow \cdots\leftarrow p_{T-1} = p_{\text{base}}, with each bridge’s score spts_{p_t} known. Each conditional is matched to its corresponding auxiliary by minimizing the Fisher divergence

DFisher(ptqt)=Eqtspt(zt)ztlogqt(zt)2.\mathcal{D}_{\mathrm{Fisher}}(p_t \| q_t) = \mathbb{E}_{q_t} \| s_{p_t}(z_t) - \nabla_{z_t} \log q_t(z_t) \|^2.

A minimax reformulation is applied per-layer, optimizing both network parameters and auxiliary critics, or globally with parameter sharing for efficient scaling (Yu et al., 2023).

5. Practical Algorithms and Diffusion-Score Acceleration

Two principal training procedures exist:

  1. Sequential (Layerwise) Training: Each qtq_t and critic utu_t is updated until local convergence before proceeding upward in the hierarchy.
  2. Joint (Parameter-Sharing) Training: All qtq_t's and critics share weights. One samples batch indices across layers, optimizes the joint SM objective weighted by layer-specific coefficients, and updates parameters via stochastic gradients.

When applied to diffusion models, the auxiliary bridging sequence is constructed from a schedule of intermediate distributions (geometric interpolates or SDE marginals), and pre-trained diffusion scores sθ(x,s)s_\theta(x,s) are directly injected for efficient score estimation. The resulting TT-layer HSIVI-SM produces high-quality samples while inducing minimal computational overhead. For DDPMs, the ϵ\epsilon-prediction variant is naturally accommodated (Yu et al., 2023).

6. Empirical Performance and Theoretical Guarantees

SIVI-SM and its hierarchical extensions exhibit the following verified properties:

  • On synthetic multimodal targets, multi-layer HSIVI-SM captures all modes, while single-layer SIVI-SM may miss modes or underestimate variance.
  • In high-dimensional posterior matching tasks (e.g., conditioned-diffusion, Bayesian logistic regression), SIVI-SM achieves errors and variance RMSE below surrogate-ELBO and unbiased-MCMC-based SIVI.
  • HSIVI-SM accelerates diffusion model sampling: with only 5–10 function evaluations, it achieves FID scores on datasets (e.g., CIFAR-10, CelebA) that are competitive or superior to DDIM, Analytic-DDPM, and DPM-Solver-fast, approaching long-chain DDPM results (Yu et al., 2023).

Consistency is provable: if the critic approximates the true score-residual function accurately, SIVI-SM approaches the variational optimum without resort to inner-loop MCMC (Yu et al., 2023).

7. Kernelization and Algorithmic Developments

Kernel SIVI (KSIVI) advances SIVI-SM by analytic solution of the critic subproblem in a reproducing kernel Hilbert space (RKHS), replacing neural network critics with a closed-form solution. The overall objective becomes minimization of the kernel Stein discrepancy (KSD) between variational and target distributions: minϕ  KSD2(qϕp)=Sqϕ,k(spsqϕ)H2,\min_\phi \; \mathrm{KSD}^2(q_\phi \| p) = \bigl\| S_{q_\phi, k}(s_p - s_{q_\phi}) \bigr\|^2_{\mathcal{H}}, where

f()=Eyqϕ[k(,y)(sp(y)sqϕ(y))].f^*(\cdot) = \mathbb{E}_{y \sim q_\phi} \bigl[ k(\cdot, y) ( s_p(y) - s_{q_\phi}(y) ) \bigr ].

KSIVI thus requires no lower-level maximization, leading to improved stability, reduced variance in gradient estimates, and straightforward convergence guarantees for nonconvex stochastic optimization. KSIVI retains the sample- and computation-efficient properties of SIVI-SM, removing the necessity for inner-loop MCMC or neural critic optimization (Cheng et al., 2024).


References:

  • "Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration" (Yu et al., 2023)
  • "Semi-Implicit Variational Inference via Score Matching" (Yu et al., 2023)
  • "Kernel Semi-Implicit Variational Inference" (Cheng et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Score-Matching Approaches to SIVI (SIVI-SM).