Pareto Smoothed Importance Sampling (PSIS)

Updated 30 March 2026

Pareto Smoothed Importance Sampling (PSIS) is a framework that stabilizes importance sampling estimators by regularizing extreme weights via a generalized Pareto distribution.
It replaces the largest raw weights with smoothed estimates to reduce variance and improve reliability, making it effective for LOO-CV in Bayesian workflows.
PSIS provides a diagnostic through the estimated Pareto shape parameter, guiding model selection and flagging potential reliability issues in high-dimensional models.

Pareto Smoothed Importance Sampling (PSIS) is a framework for stabilizing importance sampling estimators, particularly in Bayesian computation and model evaluation contexts such as leave-one-out cross-validation (LOO-CV). PSIS addresses the high variance and instability associated with heavy-tailed importance weights by regularizing the extreme weights through a fit to the generalized Pareto distribution (GPD), simultaneously providing both a theoretically grounded estimator and a diagnostic for reliability via the estimated shape parameter of the GPD fit. This approach is widely adopted in modern Bayesian workflow, including model selection, posterior predictive checking, and diagnostics in high-dimensional and hierarchical models, and is integral to robust and efficient LOO-CV methodologies (Vehtari et al., 2015, Vehtari et al., 2015, Jiang et al., 2020, Yao et al., 2018).

1. Foundations of Importance Sampling and the High-Variance Problem

Importance sampling (IS) enables expectation estimation under a target distribution $p(\theta)$ when only draws from a proposal distribution $q(\theta)$ are accessible:

$I_h = \mathbb{E}_p[h(\theta)] = \int h(\theta)p(\theta)\,d\theta$

$\hat I_h^{\rm IS} = \frac{\sum_{s=1}^S r_s\,h(\theta_s)}{\sum_{s=1}^S r_s}, \quad r_s = \frac{p(\theta_s)}{q(\theta_s)}$

However, when $r_s$ have a heavy right tail—typically due to discrepancies between $q$ and $p$ , especially in the tails—the estimator becomes dominated by a few extreme terms, resulting in very large or infinite variance. This adverse phenomenon is particularly severe in high-dimensional and complex models, where standard diagnostics can be misleading and Monte Carlo standard error estimation fails (Vehtari et al., 2015, Yao et al., 2018).

2. Pareto Smoothing: GPD Modeling of the Weight Tail

PSIS addresses the instability by substituting a portion of the largest raw weights with expected order statistics from a GPD fit. Let $r_1,\dots,r_S$ be the unsorted raw weights. Define $M=\lfloor 0.2 S \rfloor$ (or, in alternative formulations, $M=\min(S/5,3\sqrt S)$ ). The tail threshold $u$ is selected as the $(S-M)$ -th order statistic among sorted weights. The excesses above $u$ :

$y_j = r_{(S-M+j)} - u, \quad j=1,\ldots, M$

are fit with a two-parameter GPD:

$f(y; k, \sigma) = \frac{1}{\sigma} \left(1 + k\frac{y}{\sigma}\right)^{-1/k-1}, \quad y \ge 0$

Maximum likelihood or empirical Bayes (Zhang & Stephens, 2009) estimates $(\hat k, \hat\sigma)$ for the tail shape and scale (Vehtari et al., 2015, Yao et al., 2018). The $j$ -th largest weight is replaced by its GPD-expected order statistic:

$\tilde r_{(S-M+j)} = u + \frac{\hat\sigma}{\hat k}\left[\left(1-\frac{j-0.5}{M}\right)^{-\hat k}-1\right]$

Weights below threshold $u$ remain unchanged. All smoothed weights are optionally truncated at $S^{3/4}\bar r$ (where $\bar r$ denotes the mean smoothed weight) to guarantee finite variance (Vehtari et al., 2015, Vehtari et al., 2015).

3. The Pareto Shape Parameter and Diagnostic Interpretation

The core diagnostic in PSIS is the fitted Pareto shape parameter $\hat k$ , which quantifies the relative heaviness of the weight distribution tail. This parameter determines the existence of moments and the validity of asymptotic theorems:

$\hat k < 0.5$ : finite weight variance, standard CLT applies, convergence rate $O(S^{-1/2})$
$0.5 \leq \hat k < 1$ : infinite variance but finite mean, estimator converges slowly to a stable law
$\hat k \geq 1$ : mean does not exist, estimator is unreliable (Vehtari et al., 2015, Jiang et al., 2020, Yao et al., 2018)

A practical warning threshold is $\hat k \approx 0.7$ , above which the PSIS estimate is flagged as unreliable for the corresponding data point or observation. Sample-size-dependent rules such as $\hat k < 1-1/\log_{10}S$ may also be employed (Vehtari et al., 2015, Vehtari et al., 2015).

4. PSIS Algorithmic Steps for Leave-One-Out Cross-Validation

The PSIS-LOO procedure is used in Bayesian model evaluation to estimate predictive accuracy efficiently:

Compute raw importance weights for each observation $i$ and posterior draw $s$ : $r_i^s = 1/p(y_i|\theta^s)$ .
Sort and select the top $M$ weights for GPD fitting: $M = \lfloor 0.2 S \rfloor$ , $u_i = r_{i,(M+1)}$ .
Fit a GPD to the $M$ excesses $z_{i,j} = r_{i,(j)} - u_i$ .
Replace the top $M$ weights with their smoothed GPD quantiles.
Truncate and normalize all weights.
Estimate LOO predictive densities:

$\hat p(y_i|y_{-i}) = \frac{\sum_{s=1}^S w_i^s p(y_i|\theta^s)}{\sum_{s=1}^S w_i^s}$

Aggregate the expected log predictive density across all $n$ data points:

$\widehat{\mathrm{elppd}}_{\rm psis-loo} = \sum_{i=1}^n \log \hat p(y_i|y_{-i})$

Compute diagnostic $\hat k_i$ for each $i$ ; flag those with $\hat k_i > 0.7$ (Vehtari et al., 2015, Jiang et al., 2020).

In cases where a small number of $\hat k_i$ exceed the threshold, exact LOO refits for those points are recommended (PSIS-LOO+); for widespread failures, $K$ -fold cross-validation may be more appropriate (Vehtari et al., 2015).

5. Model Selection and Diagnostic Visualization

PSIS-LOO enables robust Bayesian model comparison. For each candidate model, $\widehat{\mathrm{elppd}}_{\rm psis-loo}$ and its standard error are computed. Models are ranked by $\widehat{\mathrm{elppd}}$ . If the difference in $\widehat{\mathrm{elppd}}$ between two models is less than one standard error, they are considered indistinguishable up to parsimony preference. The vector of $\hat k_i$ provides casewise influence diagnostics; values $\hat k_i>0.7$ warrant further investigation or targeted model refitting (Jiang et al., 2020).

Recommended diagnostic plots include:

PSIS k-diagnostic (scatter plot of $\hat k_i$ vs $i$ , colored by reliability thresholds): highlights problematic, high-leverage observations.
Posterior predictive check plots: overlays of observed data density and densities from replicated datasets generated under the posterior predictive distribution (Jiang et al., 2020).

6. Comparison to Alternative Stabilization Methods

Truncated importance sampling (TIS) and winsorization mitigate heavy-tailed weights by direct thresholding. TIS truncates at $w_s = \min(r_s, \sqrt{S}\,\bar r)$ , achieving finite variance at the cost of increased bias. Winsorization uses a fixed quantile cutoff. Neither provide a continuous or scale-free diagnostic such as $\hat k$ . PSIS replaces the top $M$ weights smoothly according to empirical tail shape, delivering lower root mean square error (RMSE) than TIS and IS, lower bias than TIS, and robust MCSE estimation as long as $\hat k<0.7$ (Vehtari et al., 2015).

Empirical findings demonstrate that PSIS achieves superior bias-variance tradeoffs and accurate diagnostics in Bayesian linear/logistic regression and hierarchical models—most notably, identifying pathologies in variational approximations and non-centered parametrizations (Yao et al., 2018).

7. Implementation and Practical Usage

PSIS is implemented in standard Bayesian workflow libraries, notably the R package loo, compatible with Stan. The computational overhead is negligible relative to MCMC, as the dominant costs are likelihood evaluations and GPD fits. Standard usage involves extraction of the $S \times n$ matrix of log-likelihood values and one-line function calls for PSIS-LOO computations and model comparisons (Vehtari et al., 2015). Monte Carlo error and effective sample size estimates are also provided as part of the PSIS framework:

$\mathrm{ESS} \approx \frac{1}{\sum_{s=1}^S \tilde w_s^2}$

$\widehat{\mathrm{Var}}(\hat I_h) \approx \sum_{s=1}^S \tilde w_s^2\left(h(\theta_s)-\hat I_h\right)^2$

Practical recommendations are to always inspect the $\hat k_i$ distribution. If problematic values are few, employ PSIS-LOO+; if many, consider alternative cross-validation strategies (Vehtari et al., 2015). PSIS is routinely employed for posterior predictive checks, model selection, and diagnostics in high-stakes Bayesian inference and applied statistical modeling (Jiang et al., 2020, Vehtari et al., 2015).

Markdown Report Issue Upgrade to Chat

References (4)

Pareto Smoothed Importance Sampling (2015)

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC (2015)

BayesTime: Bayesian Functional Principal Components for Sparse Longitudinal Data (2020)

Yes, but Did It Work?: Evaluating Variational Inference (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pareto Smoothed Importance Sampling (PSIS).