Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC (1507.04544v5)

Published 16 Jul 2015 in stat.CO and stat.ME

Abstract: Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparing of predictive errors between two models. We implement the computations in an R package called 'loo' and demonstrate using models fit with the Bayesian inference package Stan.

Citations (3,694)

View on Semantic Scholar

Summary

The paper introduces PSIS-LOO, a key advancement that stabilizes predictive estimates using Pareto-smoothed importance sampling.
It demonstrates that PSIS-LOO outperforms traditional LOO and WAIC, especially in models with high variance and influential data points.
The authors offer practical tools integrated with R and Stan, facilitating robust and efficient Bayesian model assessment in real-world applications.

Overview of Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC

The paper "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC" by Vehtari, Gelman, and Gabry addresses practical methodologies for evaluating Bayesian models, focusing on leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC). The authors emphasize the need for robust, reliable, and computationally efficient measures of predictive accuracy for Bayesian models, which is crucial for model selection and assessment.

Core Concepts

Both LOO cross-validation and WAIC provide estimates of a model's out-of-sample predictive performance. Traditional methods such as AIC and DIC often fall short in Bayesian settings due to their reliance on point estimates rather than full posterior distributions. The authors present LOO, specifically its approximation using Pareto-smoothed importance sampling (PSIS-LOO), as a computationally feasible method that retains accuracy in various scenarios, particularly when models involve complex structures or weak priors.

PSIS-LOO refines the traditional importance sampling method by stabilizing the estimates using a generalized Pareto distribution for the weights, making LOO cross-validation implementation both practical and robust. The method provides reliable diagnostics to identify when direct computation is more appropriate, ensuring better results even with problematic data points that challenge importance sampling assumptions.

WAIC, on the other hand, leverages the full posterior, showing asymptotic equivalence to LOO. However, it might suffer from bias in finite samples or in models with weak priors. The authors advocate for PSIS-LOO due to its robustness in various model configurations, presenting it as more reliable in finite-sample contexts than WAIC.

Results and Implications

The strong numerical results showcased in the paper illustrate the efficacy of PSIS-LOO across multiple simulated and real-world datasets. The authors conducted experiments demonstrating that PSIS-LOO outperforms both traditional LOO (without smoothing) and WAIC, particularly in situations with high predictive variance or influential observations, where WAIC’s approximation might fail.

Moreover, the methodology described in the paper has been implemented in an R package, loo, aligning with the Stan software for Bayesian modeling. This integration allows for ease of use in practical applications, facilitating broader adoption among researchers and practitioners who rely on Bayesian approaches.

Implications for Bayesian Practices and Tools

The development of practical tools like PSIS-LOO marks a significant enhancement in Bayesian model comparison and selection. The authors’ approach highlights the potential for scalability and improvement in computational methodologies, mitigating performance issues in larger data sets or more complex hierarchical models.

Future developments could explore the extension of these methodologies towards even more computationally demanding models, refine diagnostics for importance sampling degradation, and incorporate further advancements in software frameworks to enable automatic and more efficient model evaluation.

Conclusion

Vehtari, Gelman, and Gabry systematically address the challenges of Bayesian model evaluation, offering a balanced critique of traditional methods and providing methodological improvements that significantly enhance predictive accuracy evaluations in practice. PSIS-LOO emerges as a key advancement, offering both robustness and computational efficiency, thus contributing profoundly to the field of Bayesian statistics. As data complexity and model intricacies continue to grow, the demand for such improvements will undoubtedly increase, guiding future innovations in AI and computational statistics.

PDF Markdown

Related Papers

Tweets

https://twitter.com/stephenjwild/status/1823851545961619938

YouTube

Show All Videos