Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations (2504.11554v1)

Published 15 Apr 2025 in stat.ML and cs.LG

Abstract: Bayesian inference with computationally expensive likelihood evaluations remains a significant challenge in many scientific domains. We propose normalizing flow regression (NFR), a novel offline inference method for approximating posterior distributions. Unlike traditional surrogate approaches that require additional sampling or inference steps, NFR directly yields a tractable posterior approximation through regression on existing log-density evaluations. We introduce training techniques specifically for flow regression, such as tailored priors and likelihood functions, to achieve robust posterior and model evidence estimation. We demonstrate NFR's effectiveness on synthetic benchmarks and real-world applications from neuroscience and biology, showing superior or comparable performance to existing methods. NFR represents a promising approach for Bayesian inference when standard methods are computationally prohibitive or existing model evaluations can be recycled.

Summary

  • The paper presents NFR, a method that directly approximates the posterior using normalizing flows trained on pre-computed likelihood evaluations.
  • It employs a Tobit likelihood with noise shaping and an annealed optimization scheme to robustly handle expensive, noisy density measurements.
  • Experimental results show that NFR achieves competitive performance against BBVI and VSBQ on both synthetic and real-world high-dimensional problems.

This paper introduces Normalizing Flow Regression (NFR), a novel method for approximate Bayesian inference specifically designed for scenarios where evaluating the likelihood function is computationally expensive and a set of offline (pre-computed) log-density evaluations is available.

The core problem addressed is that standard Bayesian inference methods like MCMC and VI often require numerous evaluations of the target probability density (likelihood times prior), which is prohibitive when each evaluation is costly (e.g., involves complex simulations). While surrogate modeling techniques (like Gaussian Processes) exist, they typically approximate the log-density function itself and require an additional inference step (like running MCMC or VI on the surrogate) to obtain a tractable posterior approximation. Furthermore, many surrogate methods need active learning, generating new, expensive evaluations.

NFR overcomes these limitations by using a normalizing flow, qϕ(x)q_{\phi}(x), directly as a regression model to approximate the target posterior distribution p(xD)p(x|\mathcal{D}). It works with an existing dataset Ξ=(X,y,σ2)\bm{\Xi} = (\mathbf{X}, \mathbf{y}, \bm{\sigma}^2) consisting of parameter locations xnx_n, corresponding potentially noisy log-density evaluations ynlogp(xnD)y_n \approx \log p(x_n|\mathcal{D}), and associated observation variances σn2\sigma_n^2.

Key Aspects of NFR:

  1. Direct Posterior Approximation: The normalizing flow qϕ(x)q_{\phi}(x) itself serves as the approximate posterior distribution. Once trained, it can be easily evaluated and sampled from.
  2. Offline Inference: NFR utilizes pre-existing log-density evaluations, such as those collected during preliminary Maximum A Posteriori (MAP) optimization runs, avoiding the need for additional costly model evaluations during the inference phase.
  3. Regression Formulation: The model predicts the unnormalized log-density as fΘ(x)=logqϕ(x)+Cf_{\Theta}(x) = \log q_{\phi}(x) + C, where CC is a learnable parameter representing the logarithm of the unknown normalizing constant (model evidence). The model parameters Θ=(ϕ,C)\Theta = (\phi, C) are optimized by maximizing the log-posterior probability of the observed data given the model (MAP estimation on the flow parameters):

    L(Θ)=n=1Nlogp(ynfΘ(xn),σn2)+logp(ϕ)+logp(C)\mathcal{L}(\Theta) = \sum_{n = 1}^N \log p\left(y_n | f_{\Theta}(x_n), \sigma_n^2 \right) + \log p (\phi) + \log p (C)

  4. Tobit Likelihood: To prevent the regression from being dominated by very small density values (large negative log-densities), NFR employs a Tobit likelihood function combined with noise shaping. This likelihood effectively censors observations below a certain threshold ylowy_\text{low} and adds artificial noise (noise shaping) that increases for lower density points. This focuses the regression on accurately modeling the high-probability regions of the posterior while still utilizing information from lower-density areas without overfitting to them. The likelihood is defined as:

    p(ynfΘ(xn),σn2)={N(yn;fΘ(xn),σn2+s(fmaxfn)2)if yn>ylow Φ(ylowfΘ(xn)σn2+s(fmaxfn)2)if ynylowp(y_n | f_{\Theta}(x_n), \sigma_n^2)= \begin{cases} \mathcal{N}\left(y_n; f_{\Theta}(x_n), \sigma_n^2 + s(f_\text{max} - f_n)^2 \right) & \text{if } y_n > y_{\text{low}} \ \Phi\left(\frac{y_{\text{low}} - f_{\Theta}(x_n)}{\sqrt{\sigma_n^2 + s(f_\text{max} - f_n)^2}}\right) & \text{if } y_n \leq y_{\text{low}} \end{cases}

    where Φ\Phi is the standard normal CDF and s()s(\cdot) is the noise shaping function.

  5. Informative Priors: To regularize the flow and address the non-identifiability between flow parameters ϕ\phi and the log normalizing constant CC, specific priors are introduced:
    • The flow's base distribution p0p_0 is set to a multivariate Gaussian whose mean and diagonal covariance are estimated from the high-density points in the training data Ξ\bm{\Xi}.
    • The scaling and shifting transformations within the flow (specifically using MAF architecture) are constrained using bounded tanh functions (gscale(α(i))=αmaxtanh(α(i))g_\text{scale}(\alpha^{(i)}) = \alpha_{\text{max}}^{\tanh(\alpha^{(i)})}, gshift(μ(i))=μmaxtanh(μ(i))g_\text{shift}(\mu^{(i)}) = \mu_{\text{max}} \cdot \tanh(\mu^{(i)})) to prevent excessive deviation from the base distribution.
    • A Gaussian prior N(ψ;0,σψ2I)\mathcal{N} (\psi; \mathbf{0}, \sigma_{\psi}^2 \mathbf{I}) is placed on the underlying neural network parameters ψ\psi that determine the scaling (α\alpha) and shifting (μ\mu) parameters. σψ\sigma_\psi is calibrated via prior predictive checks.
    • An improper flat prior is used for CC.
  6. Annealed Optimization: Optimization is performed gradually using an annealing schedule. The target density is interpolated between the base distribution p0(x)p_0(x) and the target posterior ptarget(x)p_\text{target}(x) using an inverse temperature βt\beta_t that increases from 0 to 1. This stabilizes the optimization process, starting from a known normalized distribution (p0p_0) and gradually incorporating the target information. The optimization objective uses tempered observations y~βt=(1βt)logp0(X)+βty\widetilde{y}_{\beta_t} = (1 - \beta_t) \log p_0(\mathbf{X}) + \beta_t \mathbf{y}.
  7. Optimization Procedure: The algorithm first optimizes CC alone using Brent's method, then jointly optimizes ϕ\phi and CC using L-BFGS.

Implementation and Experiments:

  • The authors implemented NFR using PyTorch and the nflows library, specifically employing the Masked Autoregressive Flow (MAF) architecture.
  • They evaluated NFR on synthetic problems (Multivariate Rosenbrock-Gaussian, Lumpy distribution) and real-world applications from neuroscience and biology (Bayesian timing model, Lotka-Volterra predator-prey model, Bayesian causal inference model for multisensory perception). Dimensions ranged from D=5D=5 to D=12D=12.
  • NFR was compared against Laplace approximation, Black-Box Variational Inference (BBVI) using the same flow architecture but requiring online evaluations, and Variational Sparse Bayesian Quadrature (VSBQ), another offline surrogate method.
  • Training data for NFR and VSBQ consisted of $3000D$ log-density evaluations collected from MAP optimization runs using CMA-ES or BADS.
  • Performance was measured using the absolute difference in log marginal likelihood (Δ\DeltaLML), mean marginal total variation distance (MMTV), and Gaussianized symmetrized KL divergence (GsKL).
  • NFR demonstrated strong performance, often outperforming or matching the baselines, especially on the more challenging problems (e.g., higher dimensions, noisy or expensive likelihoods). It significantly outperformed VSBQ on the 12D multisensory problem. BBVI struggled with convergence or required significantly more function evaluations (10×10\times budget) to achieve comparable results, and was often infeasible for expensive likelihoods.

Practical Implications:

  • NFR provides a practical tool for Bayesian inference when likelihood evaluations are expensive, allowing researchers to reuse evaluations gathered from preliminary optimization runs (e.g., finding MAP estimates).
  • It directly outputs a tractable posterior approximation (the trained flow) and an estimate of the model evidence (CC).
  • The proposed training techniques (Tobit likelihood, priors, annealing) are crucial for robust performance.
  • The method is demonstrated to work well on problems up to D=12D=12, common in scientific modeling.
  • The authors provide code and suggest diagnostic checks (PSIS k^\hat{k}, corner plots with training data overlay) to assess the quality of the approximation.

In summary, NFR offers a promising offline approach to Bayesian inference for computationally demanding models, bridging the gap between point estimation and full posterior approximation by effectively recycling existing model evaluations within a normalizing flow regression framework.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets