Score-Matching in Semi-Implicit VI

Updated 24 January 2026

The paper introduces a score-matching approach (SIVI-SM) that directly optimizes the score function for semi-implicit variational distributions.
It replaces the conventional KL divergence with Fisher divergence, enabling tractable, MCMC-free training despite intractable marginal densities.
Empirical and theoretical extensions, including hierarchical and kernelized methods, demonstrate improved mode capture, accelerated diffusion sampling, and convergence guarantees.

Score-matching approaches to semi-implicit variational inference (SIVI-SM) constitute a principled methodology for training semi-implicit variational families by leveraging the Fisher (score) divergence instead of the conventional Kullback-Leibler (KL) divergence or surrogate evidence lower bounds (ELBOs). SIVI-SM directly optimizes over the score function, exploiting the hierarchical or semi-implicit construction of the variational distributions. The approach eliminates the need for tractable densities of marginals and enables scalable, MCMC-free training for expressive approximations of complicated posteriors. Hierarchical and kernelized extensions further expand SIVI-SM’s flexibility and computational efficiency.

1. Semi-Implicit Variational Inference and Intractable Densities

Traditional variational inference restricts the variational family $q_\varphi(z)$ to tractable, explicit distributions, optimizing

$\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$

SIVI generalizes this by introducing a “mixing” variable $\varepsilon$ , yielding a semi-implicit family,

$q_\varphi(z) = \int q_\phi(z\mid\varepsilon) \, q_\xi(\varepsilon) \, d\varepsilon, \quad \varphi=(\phi,\xi),$

where either the prior $q_\xi$ or the conditional $q_\phi(z\mid\varepsilon)$ is chosen to be implicit, and at least one distribution remains explicit and reparameterizable. Unless conjugacies exist, the marginal $q_\varphi(z)$ and its log-density are intractable, precluding standard ELBO gradients. Prior strategies circumvent this by using multi-sample surrogate bounds or expensive inner-loop MCMC, which introduces bias or large computational overhead (Yu et al., 2023).

2. Fisher Divergence and Score-Matching Objective

SIVI-SM replaces the KL divergence between variational and target densities with the Fisher (score) divergence: $J(\varphi) = \mathbb{E}_{z\sim q_\varphi}\left\|\,\nabla_z\log q_\varphi(z) - \nabla_z\log p(x,z)\,\right\|^2_2.$ Hyvärinen’s continuous score-matching formulation classically requires trace computations of vector fields, but SIVI-SM leverages denoising score matching and the structure of the hierarchy to obtain a tractable expectation. The key adaptation is the absence of explicit data corruption; instead, SIVI-SM analytically computes the score of the marginal through the mixture structure: $\nabla_z\log q_\varphi(z) = \frac{1}{q_\varphi(z)}\int q_\xi(\varepsilon)\, q_\phi(z\mid\varepsilon)\, \nabla_z \log q_\phi(z\mid\varepsilon)\, d\varepsilon.$ All expectations are taken over samples from the generative hierarchy, ensuring scalability and avoiding separate MCMC estimation of gradients (Yu et al., 2023).

3. Minimax Formulation and Critic Network

To bypass intractable marginal scores, SIVI-SM recasts the training in a minimax formulation: $\min_{\varphi=(\phi,\xi)} \; \max_{f} \; \mathbb{E}_{z\sim q_\varphi} \left[ 2 f(z)^\top ( \nabla_z \log p(x,z) - \nabla_z \log q_\phi(z\mid\varepsilon) ) - \|f(z)\|^2 \right ].$ Here, $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 0 (the critic) can be a neural network, targeting optimality at $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 1. The score-matching objective reduces to the Fisher divergence at the Nash equilibrium. All relevant expectations are tractable via hierarchical reparameterization. The SIVI-SM training algorithm alternates between updating the variational parameters and the critic’s parameters by stochastic optimization, using samples generated via the semi-implicit construction (Yu et al., 2023).

Pseudocode Outline (SIVI-SM Minimix)

Sample mixing variables and conditional noise
Generate $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 2 samples using hierarchical reparameterization
Compute target scores $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 3 and conditional scores $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 4
Update $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 5 (via minimization) and $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 6 (via maximization) using stochastic gradients

4. Hierarchical and Layer-wise Score-Matching

In complex settings where a single semi-implicit layer is insufficiently expressive, Hierarchical SIVI (HSIVI) stacks multiple conditional sampling layers:

$\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 7: explicit base
$\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 8, ..., $\mathrm{ELBO}(\varphi) = \mathbb{E}_{z\sim q_\varphi}\bigl[\log p(x,z) - \log q_\varphi(z)\bigr].$ 9

The marginal at each layer $\varepsilon$ 0 is recursively defined as

$\varepsilon$ 1

To facilitate optimization, HSIVI utilizes a bridging sequence $\varepsilon$ 2, with each bridge’s score $\varepsilon$ 3 known. Each conditional is matched to its corresponding auxiliary by minimizing the Fisher divergence

$\varepsilon$ 4

A minimax reformulation is applied per-layer, optimizing both network parameters and auxiliary critics, or globally with parameter sharing for efficient scaling (Yu et al., 2023).

5. Practical Algorithms and Diffusion-Score Acceleration

Two principal training procedures exist:

Sequential (Layerwise) Training: Each $\varepsilon$ 5 and critic $\varepsilon$ 6 is updated until local convergence before proceeding upward in the hierarchy.
Joint (Parameter-Sharing) Training: All $\varepsilon$ 7's and critics share weights. One samples batch indices across layers, optimizes the joint SM objective weighted by layer-specific coefficients, and updates parameters via stochastic gradients.

When applied to diffusion models, the auxiliary bridging sequence is constructed from a schedule of intermediate distributions (geometric interpolates or SDE marginals), and pre-trained diffusion scores $\varepsilon$ 8 are directly injected for efficient score estimation. The resulting $\varepsilon$ 9-layer HSIVI-SM produces high-quality samples while inducing minimal computational overhead. For DDPMs, the $q_\varphi(z) = \int q_\phi(z\mid\varepsilon) \, q_\xi(\varepsilon) \, d\varepsilon, \quad \varphi=(\phi,\xi),$ 0-prediction variant is naturally accommodated (Yu et al., 2023).

6. Empirical Performance and Theoretical Guarantees

SIVI-SM and its hierarchical extensions exhibit the following verified properties:

On synthetic multimodal targets, multi-layer HSIVI-SM captures all modes, while single-layer SIVI-SM may miss modes or underestimate variance.
In high-dimensional posterior matching tasks (e.g., conditioned-diffusion, Bayesian logistic regression), SIVI-SM achieves errors and variance RMSE below surrogate-ELBO and unbiased-MCMC-based SIVI.
HSIVI-SM accelerates diffusion model sampling: with only 5–10 function evaluations, it achieves FID scores on datasets (e.g., CIFAR-10, CelebA) that are competitive or superior to DDIM, Analytic-DDPM, and DPM-Solver-fast, approaching long-chain DDPM results (Yu et al., 2023).

Consistency is provable: if the critic approximates the true score-residual function accurately, SIVI-SM approaches the variational optimum without resort to inner-loop MCMC (Yu et al., 2023).

7. Kernelization and Algorithmic Developments

Kernel SIVI (KSIVI) advances SIVI-SM by analytic solution of the critic subproblem in a reproducing kernel Hilbert space (RKHS), replacing neural network critics with a closed-form solution. The overall objective becomes minimization of the kernel Stein discrepancy (KSD) between variational and target distributions: $q_\varphi(z) = \int q_\phi(z\mid\varepsilon) \, q_\xi(\varepsilon) \, d\varepsilon, \quad \varphi=(\phi,\xi),$ 1 where

$q_\varphi(z) = \int q_\phi(z\mid\varepsilon) \, q_\xi(\varepsilon) \, d\varepsilon, \quad \varphi=(\phi,\xi),$ 2

KSIVI thus requires no lower-level maximization, leading to improved stability, reduced variance in gradient estimates, and straightforward convergence guarantees for nonconvex stochastic optimization. KSIVI retains the sample- and computation-efficient properties of SIVI-SM, removing the necessity for inner-loop MCMC or neural critic optimization (Cheng et al., 2024).

References:

"Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration" (Yu et al., 2023)
"Semi-Implicit Variational Inference via Score Matching" (Yu et al., 2023)
"Kernel Semi-Implicit Variational Inference" (Cheng et al., 2024)

Markdown Report Issue Upgrade to Chat

References (3)

Semi-Implicit Variational Inference via Score Matching (2023)

Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration (2023)

Kernel Semi-Implicit Variational Inference (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Score-Matching Approaches to SIVI (SIVI-SM).