Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations (2504.11554v1)

Published 15 Apr 2025 in stat.ML and cs.LG

Abstract: Bayesian inference with computationally expensive likelihood evaluations remains a significant challenge in many scientific domains. We propose normalizing flow regression (NFR), a novel offline inference method for approximating posterior distributions. Unlike traditional surrogate approaches that require additional sampling or inference steps, NFR directly yields a tractable posterior approximation through regression on existing log-density evaluations. We introduce training techniques specifically for flow regression, such as tailored priors and likelihood functions, to achieve robust posterior and model evidence estimation. We demonstrate NFR's effectiveness on synthetic benchmarks and real-world applications from neuroscience and biology, showing superior or comparable performance to existing methods. NFR represents a promising approach for Bayesian inference when standard methods are computationally prohibitive or existing model evaluations can be recycled.

Summary

The paper presents NFR, a method that directly approximates the posterior using normalizing flows trained on pre-computed likelihood evaluations.
It employs a Tobit likelihood with noise shaping and an annealed optimization scheme to robustly handle expensive, noisy density measurements.
Experimental results show that NFR achieves competitive performance against BBVI and VSBQ on both synthetic and real-world high-dimensional problems.

This paper introduces Normalizing Flow Regression (NFR), a novel method for approximate Bayesian inference specifically designed for scenarios where evaluating the likelihood function is computationally expensive and a set of offline (pre-computed) log-density evaluations is available.

The core problem addressed is that standard Bayesian inference methods like MCMC and VI often require numerous evaluations of the target probability density (likelihood times prior), which is prohibitive when each evaluation is costly (e.g., involves complex simulations). While surrogate modeling techniques (like Gaussian Processes) exist, they typically approximate the log-density function itself and require an additional inference step (like running MCMC or VI on the surrogate) to obtain a tractable posterior approximation. Furthermore, many surrogate methods need active learning, generating new, expensive evaluations.

NFR overcomes these limitations by using a normalizing flow, $q_{\phi}(x)$ , directly as a regression model to approximate the target posterior distribution $p(x|\mathcal{D})$ . It works with an existing dataset $\bm{\Xi} = (\mathbf{X}, \mathbf{y}, \bm{\sigma}^2)$ consisting of parameter locations $x_n$ , corresponding potentially noisy log-density evaluations $y_n \approx \log p(x_n|\mathcal{D})$ , and associated observation variances $\sigma_n^2$ .

Key Aspects of NFR:

Direct Posterior Approximation: The normalizing flow $q_{\phi}(x)$ itself serves as the approximate posterior distribution. Once trained, it can be easily evaluated and sampled from.
Offline Inference: NFR utilizes pre-existing log-density evaluations, such as those collected during preliminary Maximum A Posteriori (MAP) optimization runs, avoiding the need for additional costly model evaluations during the inference phase.
Regression Formulation: The model predicts the unnormalized log-density as $f_{\Theta}(x) = \log q_{\phi}(x) + C$ , where $C$ is a learnable parameter representing the logarithm of the unknown normalizing constant (model evidence). The model parameters $\Theta = (\phi, C)$ are optimized by maximizing the log-posterior probability of the observed data given the model (MAP estimation on the flow parameters):

$\mathcal{L}(\Theta) = \sum_{n = 1}^N \log p\left(y_n | f_{\Theta}(x_n), \sigma_n^2 \right) + \log p (\phi) + \log p (C)$
Tobit Likelihood: To prevent the regression from being dominated by very small density values (large negative log-densities), NFR employs a Tobit likelihood function combined with noise shaping. This likelihood effectively censors observations below a certain threshold $y_\text{low}$ and adds artificial noise (noise shaping) that increases for lower density points. This focuses the regression on accurately modeling the high-probability regions of the posterior while still utilizing information from lower-density areas without overfitting to them. The likelihood is defined as:

$p(y_n | f_{\Theta}(x_n), \sigma_n^2)= \begin{cases} \mathcal{N}\left(y_n; f_{\Theta}(x_n), \sigma_n^2 + s(f_\text{max} - f_n)^2 \right) & \text{if } y_n > y_{\text{low}} \ \Phi\left(\frac{y_{\text{low}} - f_{\Theta}(x_n)}{\sqrt{\sigma_n^2 + s(f_\text{max} - f_n)^2}}\right) & \text{if } y_n \leq y_{\text{low}} \end{cases}$

where $\Phi$ is the standard normal CDF and $s(\cdot)$ is the noise shaping function.
Informative Priors: To regularize the flow and address the non-identifiability between flow parameters $\phi$ $ϕ$ and the log normalizing constant $C$ $C$ , specific priors are introduced:
- The flow's base distribution $p_0$ is set to a multivariate Gaussian whose mean and diagonal covariance are estimated from the high-density points in the training data $\bm{\Xi}$ .
- The scaling and shifting transformations within the flow (specifically using MAF architecture) are constrained using bounded tanh functions ( $g_\text{scale}(\alpha^{(i)}) = \alpha_{\text{max}}^{\tanh(\alpha^{(i)})}$ , $g_\text{shift}(\mu^{(i)}) = \mu_{\text{max}} \cdot \tanh(\mu^{(i)})$ ) to prevent excessive deviation from the base distribution.
- A Gaussian prior $\mathcal{N} (\psi; \mathbf{0}, \sigma_{\psi}^2 \mathbf{I})$ is placed on the underlying neural network parameters $\psi$ that determine the scaling ( $\alpha$ ) and shifting ( $\mu$ ) parameters. $\sigma_\psi$ is calibrated via prior predictive checks.
- An improper flat prior is used for $C$ .
Annealed Optimization: Optimization is performed gradually using an annealing schedule. The target density is interpolated between the base distribution $p_0(x)$ and the target posterior $p_\text{target}(x)$ using an inverse temperature $\beta_t$ that increases from 0 to 1. This stabilizes the optimization process, starting from a known normalized distribution ( $p_0$ ) and gradually incorporating the target information. The optimization objective uses tempered observations $\widetilde{y}_{\beta_t} = (1 - \beta_t) \log p_0(\mathbf{X}) + \beta_t \mathbf{y}$ .
Optimization Procedure: The algorithm first optimizes $C$ alone using Brent's method, then jointly optimizes $\phi$ and $C$ using L-BFGS.

Implementation and Experiments:

The authors implemented NFR using PyTorch and the nflows library, specifically employing the Masked Autoregressive Flow (MAF) architecture.
They evaluated NFR on synthetic problems (Multivariate Rosenbrock-Gaussian, Lumpy distribution) and real-world applications from neuroscience and biology (Bayesian timing model, Lotka-Volterra predator-prey model, Bayesian causal inference model for multisensory perception). Dimensions ranged from $D=5$ to $D=12$ .
NFR was compared against Laplace approximation, Black-Box Variational Inference (BBVI) using the same flow architecture but requiring online evaluations, and Variational Sparse Bayesian Quadrature (VSBQ), another offline surrogate method.
Training data for NFR and VSBQ consisted of $3000D$ log-density evaluations collected from MAP optimization runs using CMA-ES or BADS.
Performance was measured using the absolute difference in log marginal likelihood ( $\Delta$ LML), mean marginal total variation distance (MMTV), and Gaussianized symmetrized KL divergence (GsKL).
NFR demonstrated strong performance, often outperforming or matching the baselines, especially on the more challenging problems (e.g., higher dimensions, noisy or expensive likelihoods). It significantly outperformed VSBQ on the 12D multisensory problem. BBVI struggled with convergence or required significantly more function evaluations ( $10\times$ budget) to achieve comparable results, and was often infeasible for expensive likelihoods.

Practical Implications:

NFR provides a practical tool for Bayesian inference when likelihood evaluations are expensive, allowing researchers to reuse evaluations gathered from preliminary optimization runs (e.g., finding MAP estimates).
It directly outputs a tractable posterior approximation (the trained flow) and an estimate of the model evidence ( $C$ ).
The proposed training techniques (Tobit likelihood, priors, annealing) are crucial for robust performance.
The method is demonstrated to work well on problems up to $D=12$ , common in scientific modeling.
The authors provide code and suggest diagnostic checks (PSIS $\hat{k}$ , corner plots with training data overlay) to assess the quality of the approximation.

In summary, NFR offers a promising offline approach to Bayesian inference for computationally demanding models, bridging the gap between point estimation and full posterior approximation by effectively recycling existing model evaluations within a normalizing flow regression framework.

PDF Markdown

Related Papers

Find Related Papers

Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations (2504.11554v1)

Summary

Related Papers

Tweets