Score-Matching in Semi-Implicit VI
- The paper introduces a score-matching approach (SIVI-SM) that directly optimizes the score function for semi-implicit variational distributions.
- It replaces the conventional KL divergence with Fisher divergence, enabling tractable, MCMC-free training despite intractable marginal densities.
- Empirical and theoretical extensions, including hierarchical and kernelized methods, demonstrate improved mode capture, accelerated diffusion sampling, and convergence guarantees.
Score-matching approaches to semi-implicit variational inference (SIVI-SM) constitute a principled methodology for training semi-implicit variational families by leveraging the Fisher (score) divergence instead of the conventional Kullback-Leibler (KL) divergence or surrogate evidence lower bounds (ELBOs). SIVI-SM directly optimizes over the score function, exploiting the hierarchical or semi-implicit construction of the variational distributions. The approach eliminates the need for tractable densities of marginals and enables scalable, MCMC-free training for expressive approximations of complicated posteriors. Hierarchical and kernelized extensions further expand SIVI-SM’s flexibility and computational efficiency.
1. Semi-Implicit Variational Inference and Intractable Densities
Traditional variational inference restricts the variational family to tractable, explicit distributions, optimizing
SIVI generalizes this by introducing a “mixing” variable , yielding a semi-implicit family,
where either the prior or the conditional is chosen to be implicit, and at least one distribution remains explicit and reparameterizable. Unless conjugacies exist, the marginal and its log-density are intractable, precluding standard ELBO gradients. Prior strategies circumvent this by using multi-sample surrogate bounds or expensive inner-loop MCMC, which introduces bias or large computational overhead (Yu et al., 2023).
2. Fisher Divergence and Score-Matching Objective
SIVI-SM replaces the KL divergence between variational and target densities with the Fisher (score) divergence: Hyvärinen’s continuous score-matching formulation classically requires trace computations of vector fields, but SIVI-SM leverages denoising score matching and the structure of the hierarchy to obtain a tractable expectation. The key adaptation is the absence of explicit data corruption; instead, SIVI-SM analytically computes the score of the marginal through the mixture structure: All expectations are taken over samples from the generative hierarchy, ensuring scalability and avoiding separate MCMC estimation of gradients (Yu et al., 2023).
3. Minimax Formulation and Critic Network
To bypass intractable marginal scores, SIVI-SM recasts the training in a minimax formulation: Here, (the critic) can be a neural network, targeting optimality at . The score-matching objective reduces to the Fisher divergence at the Nash equilibrium. All relevant expectations are tractable via hierarchical reparameterization. The SIVI-SM training algorithm alternates between updating the variational parameters and the critic’s parameters by stochastic optimization, using samples generated via the semi-implicit construction (Yu et al., 2023).
Pseudocode Outline (SIVI-SM Minimix)
- Sample mixing variables and conditional noise
- Generate samples using hierarchical reparameterization
- Compute target scores and conditional scores
- Update (via minimization) and (via maximization) using stochastic gradients
4. Hierarchical and Layer-wise Score-Matching
In complex settings where a single semi-implicit layer is insufficiently expressive, Hierarchical SIVI (HSIVI) stacks multiple conditional sampling layers:
- : explicit base
- , ...,
The marginal at each layer is recursively defined as
To facilitate optimization, HSIVI utilizes a bridging sequence , with each bridge’s score known. Each conditional is matched to its corresponding auxiliary by minimizing the Fisher divergence
A minimax reformulation is applied per-layer, optimizing both network parameters and auxiliary critics, or globally with parameter sharing for efficient scaling (Yu et al., 2023).
5. Practical Algorithms and Diffusion-Score Acceleration
Two principal training procedures exist:
- Sequential (Layerwise) Training: Each and critic is updated until local convergence before proceeding upward in the hierarchy.
- Joint (Parameter-Sharing) Training: All 's and critics share weights. One samples batch indices across layers, optimizes the joint SM objective weighted by layer-specific coefficients, and updates parameters via stochastic gradients.
When applied to diffusion models, the auxiliary bridging sequence is constructed from a schedule of intermediate distributions (geometric interpolates or SDE marginals), and pre-trained diffusion scores are directly injected for efficient score estimation. The resulting -layer HSIVI-SM produces high-quality samples while inducing minimal computational overhead. For DDPMs, the -prediction variant is naturally accommodated (Yu et al., 2023).
6. Empirical Performance and Theoretical Guarantees
SIVI-SM and its hierarchical extensions exhibit the following verified properties:
- On synthetic multimodal targets, multi-layer HSIVI-SM captures all modes, while single-layer SIVI-SM may miss modes or underestimate variance.
- In high-dimensional posterior matching tasks (e.g., conditioned-diffusion, Bayesian logistic regression), SIVI-SM achieves errors and variance RMSE below surrogate-ELBO and unbiased-MCMC-based SIVI.
- HSIVI-SM accelerates diffusion model sampling: with only 5–10 function evaluations, it achieves FID scores on datasets (e.g., CIFAR-10, CelebA) that are competitive or superior to DDIM, Analytic-DDPM, and DPM-Solver-fast, approaching long-chain DDPM results (Yu et al., 2023).
Consistency is provable: if the critic approximates the true score-residual function accurately, SIVI-SM approaches the variational optimum without resort to inner-loop MCMC (Yu et al., 2023).
7. Kernelization and Algorithmic Developments
Kernel SIVI (KSIVI) advances SIVI-SM by analytic solution of the critic subproblem in a reproducing kernel Hilbert space (RKHS), replacing neural network critics with a closed-form solution. The overall objective becomes minimization of the kernel Stein discrepancy (KSD) between variational and target distributions: where
KSIVI thus requires no lower-level maximization, leading to improved stability, reduced variance in gradient estimates, and straightforward convergence guarantees for nonconvex stochastic optimization. KSIVI retains the sample- and computation-efficient properties of SIVI-SM, removing the necessity for inner-loop MCMC or neural critic optimization (Cheng et al., 2024).
References:
- "Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration" (Yu et al., 2023)
- "Semi-Implicit Variational Inference via Score Matching" (Yu et al., 2023)
- "Kernel Semi-Implicit Variational Inference" (Cheng et al., 2024)