Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance (2309.13775v4)

Published 24 Sep 2023 in cs.LG, q-bio.GN, and stat.ML

Abstract: Quantifying variable importance is essential for answering high-stakes questions in fields like genetics, public policy, and medicine. Current methods generally calculate variable importance for a given model trained on a given dataset. However, for a given dataset, there may be many models that explain the target outcome equally well; without accounting for all possible explanations, different researchers may arrive at many conflicting yet equally valid conclusions given the same data. Additionally, even when accounting for all possible explanations for a given dataset, these insights may not generalize because not all good explanations are stable across reasonable data perturbations. We propose a new variable importance framework that quantifies the importance of a variable across the set of all good models and is stable across the data distribution. Our framework is extremely flexible and can be integrated with most existing model classes and global variable importance metrics. We demonstrate through experiments that our framework recovers variable importance rankings for complex simulation setups where other methods fail. Further, we show that our framework accurately estimates the true importance of a variable for the underlying data distribution. We provide theoretical guarantees on the consistency and finite sample error rates for our estimator. Finally, we demonstrate its utility with a real-world case study exploring which genes are important for predicting HIV load in persons with HIV, highlighting an important gene that has not previously been studied in connection with HIV. Code is available at https://github.com/jdonnelly36/Rashomon_Importance_Distribution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347, 2010.
  2. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58:82–115, 2020.
  3. Surrogate-based global sensitivity analysis with statistical guarantees via floodgate. arXiv preprint arXiv:2208.05885, 2022.
  4. HIV enhancer activity perpetuated by NF-κ𝜅\kappaitalic_κB induction on infection of monocytes. Nature, 350(6320):709–712, 1991.
  5. Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences, 115(8):1943–1948, 2018.
  6. Leo Breiman. Random forests. Machine learning, 45:5–32, 2001a.
  7. Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199–231, 2001b.
  8. Analyzing bagging. The Annals of Statistics, 30(4):927–961, 2002.
  9. Kernel feature selection via conditional covariance minimization. Advances in Neural Information Processing Systems, 30, 2017.
  10. Understanding and exploring the whole set of good sparse generalized additive models. Advances in Neural Information Processing Systems, 36, 2023.
  11. A theory of statistical inference for ensuring the robustness of scientific results. Management Science, 67(10):6174–6197, 2021.
  12. Exploring the cloud of variable importance for the set of all good models. Nature Machine Intelligence, 2(12):810–824, 2020.
  13. Veridicalflow: a python package for building trustworthy data science pipelines with PCS. Journal of Open Source Software, 7(69):3895, 2022.
  14. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 2020.
  15. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177):1–81, 2019.
  16. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
  17. Jerome H Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67, 1991.
  18. Yves Grandvalet. Stability of bagged decision trees. In Proceedings of the XLIII Scientific Meeting of the Italian Statistical Society, pages 221–230. CLEUP, 2006.
  19. A simple and effective model-based variable importance measure. arXiv preprint arXiv:1805.04755, 2018.
  20. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  21. Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing, 31(6):1–16, 2021.
  22. Stability of feature selection algorithms. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 8–pp. IEEE, 2005.
  23. Transcriptional regulation of the HIV-1 promoter by NF-κ𝜅\kappaitalic_κ B in vitro. Genes & Development, 6(5):761–774, 1992.
  24. How we analyzed the compas recidivism algorithm. ProPublica, May 2016. URL https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
  25. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  26. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pages 6150–6160. PMLR, 2020.
  27. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
  28. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
  29. Integrated single-cell multiomic analysis of HIV latency reversal reveals novel regulators of viral reactivation. bioRxiv, pages 2022–07, 2022.
  30. Predictive multiplicity in classification. In International Conference on Machine Learning, pages 6765–6774. PMLR, 2020.
  31. Explainable artificial intelligence: a comprehensive review. Artificial Intelligence Review, pages 1–66, 2022.
  32. Systemic HIV and SIV latency reversal via non-canonical NF-κ𝜅\kappaitalic_κB signalling in vivo. Nature, 578(7793):160–165, 2020.
  33. On the stability of feature selection algorithms. The Journal of Machine Learning Research, 18(1):6345–6398, 2017.
  34. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nature Reviews Genetics, pages 1–13, 2022.
  35. Long noncoding RNA MALAT1 releases epigenetic silencing of HIV-1 replication by displacing the polycomb repressive complex 2 from binding to the LTR promoter. Nucleic Acids Research, 47(6):3013–3027, 2019.
  36. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. The Annals of Statistics, 47(6):3438 – 3469, 2019. doi: 10.1214/18-AOS1784.
  37. Why are we using black box models in AI when we don’t need to? A lesson from an explainable AI competition. Harvard Data Science Review, 1(2):10–1162, 2019.
  38. The cellular protein hnRNP A2/B1 enhances HIV-1 transcription by unfolding LTR promoter G-quadruplexes. Scientific Reports, 7(1):1–13, 2017.
  39. Pivotal role of cyclic nucleoside phosphodiesterase 4 in tat-mediated CD4+ T cell hyperactivation and HIV type 1 replication. Proceedings of the National Academy of Sciences, 97(26):14620–14625, 2000.
  40. On the existence of simpler machine learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022.
  41. Model class reliance for random forests. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 22305–22315. Curran Associates, Inc., 2020.
  42. The MONK’s problems: A performance comparison of different learning algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University, Computer Science Department, Pittsburgh, PA, 1991.
  43. Robust optimization using machine learning for uncertainty sets. ECML/PKDD 2014, page 121, 2014.
  44. Initial whole-genome sequencing and analysis of the host genetic contribution to COVID-19 severity and susceptibility. Cell Discovery, 6(1):83, 2020.
  45. A nuclear NKRF interacting long noncoding RNA controls EBV eradication and suppresses tumor progression in natural killer/T-cell lymphoma. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, page 166722, 2023.
  46. Wikipedia contributors. Box plot, 2023. URL https://en.wikipedia.org/wiki/Box_plot. [Online; accessed 14-May-2023].
  47. Efficient nonparametric statistical inference on population feature importance using shapley values. In International Conference on Machine Learning, pages 10282–10291. PMLR, 2020.
  48. A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, pages 1–14, 2021.
  49. Exploring the whole rashomon set of sparse decision trees. Advances in Neural Information Processing Systems, 35:22305–22315, 2022.
  50. Bin Yu. Stability. Bernoulli, 19(4):1484 – 1500, 2013. doi: 10.3150/13-BEJSP14.
  51. Veridical data science. Proceedings of the National Academy of Sciences, 117(8):3920–3929, 2020. doi: 10.1073/pnas.1901326117.
  52. Lu Zhang and Lucas Janson. Floodgate: inference for model-free variable importance. arXiv preprint arXiv:2007.01283, 2020.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com