Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interpretability Needs a New Paradigm (2405.05386v2)

Published 8 May 2024 in cs.LG, cs.CL, cs.CV, and stat.ML

Abstract: Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in AI, which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (121)
  1. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, volume 2018-Decem, pp.  9505–9515. Curran Associates, Inc., 10 2018. URL http://arxiv.org/abs/1810.03292.
  2. Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation. In International Conference on Learning Representations, pp.  1–13, 2021. URL https://openreview.net/forum?id=xNOVfCCvDpM.
  3. Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models. arXiv, 2024. URL http://arxiv.org/abs/2402.04614.
  4. Normlime: A new feature importance metric for explaining deep neural networks. arXiv, 9 2019. ISSN 23318422. URL http://arxiv.org/abs/1909.04200.
  5. A review of modularization techniques in artificial neural networks. Artificial Intelligence Review, 52(1):527–561, 6 2019. ISSN 0269-2821. doi: 10.1007/s10462-019-09706-7. URL http://link.springer.com/10.1007/s10462-019-09706-7.
  6. Neural Module Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  39–48. IEEE, 6 2016. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.12. URL http://ieeexplore.ieee.org/document/7780381/.
  7. What we can’t measure, We can’t understand: Challenges to demographic data procurement in the pursuit of fairness. FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  249–260, 2021. doi: 10.1145/3442188.3445888.
  8. Anthropic Team. Language Models (Mostly) Know What They Know. Anthropic, 7 2022. URL http://arxiv.org/abs/2207.05221.
  9. One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques. arXiv, 9 2019. ISSN 23318422. URL http://arxiv.org/abs/1909.03012.
  10. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 12 2010. ISSN 15324435. URL http://arxiv.org/abs/0912.1128.
  11. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp.  1–15. International Conference on Learning Representations, ICLR, 9 2015. URL https://arxiv.org/abs/1409.0473.
  12. SAM: The Sensitivity of Attribution Methods to Hyperparameters. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.  11–21. IEEE, 6 2020. ISBN 978-1-7281-9360-1. doi: 10.1109/CVPRW50498.2020.00009. URL https://ieeexplore.ieee.org/document/9150607/.
  13. Fairness and Machine Learning: Limitations and Opportunities. fairmlbook.org, 2019. URL https://fairmlbook.org/.
  14. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  149–155, Stroudsburg, PA, USA, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.14. URL https://www.aclweb.org/anthology/2020.blackboxnlp-1.14.
  15. “Will You Find These Shortcuts?” A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  976–991, Stroudsburg, PA, USA, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.64. URL https://aclanthology.org/2022.emnlp-main.64.
  16. Belinkov, Y. Probing Classifiers: Promises, Shortcomings, and Advances. arXiv, pp.  1–12, 2 2021. URL http://arxiv.org/abs/2102.12452.
  17. Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72, 4 2019. ISSN 2307-387X. doi: 10.1162/tacl–“˙˝a–“˙˝00254. URL https://doi.org/10.1162/tacl_a_00254.
  18. Interpretability and Analysis in Neural NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp.  1–5, Stroudsburg, PA, USA, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-tutorials.1. URL https://www.aclweb.org/anthology/2020.acl-tutorials.1.
  19. Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability. arXiv, pp.  1–14, 2023. URL http://arxiv.org/abs/2307.15007.
  20. Explainable Machine Learning in Deployment. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp.  648–657, 9 2019. doi: 10.1145/3351095.3375624. URL https://dl.acm.org/doi/10.1145/3351095.3375624.
  21. Classification by Set Cover: The Prototype Vector Machine. arXiv, pp.  1–24, 2009. URL http://arxiv.org/abs/0908.2284.
  22. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):1–36, 1 2024. ISSN 0027-8424. doi: 10.1073/pnas.2304406120. URL https://pnas.org/doi/10.1073/pnas.2304406120.
  23. Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers. In Artificial Neural Networks and Machine Learning – ICANN 2016, volume 9887 LNCS, pp.  63–71, 2016. doi: 10.1007/978-3-319-44781-0–“˙˝8. URL http://link.springer.com/10.1007/978-3-319-44781-0_8.
  24. Exemplary Natural Images Explain Cnn Activations Better Than State-of-the-Art Feature Visualization. ICLR 2021 - 9th International Conference on Learning Representations, pp.  1–41, 2021.
  25. Thread: Circuits. Distill, 5(3), 3 2020. ISSN 2476-0757. doi: 10.23915/distill.00024. URL https://distill.pub/2020/circuits.
  26. Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics, 8(8):832, 7 2019. ISSN 2079-9292. doi: 10.3390/electronics8080832. URL https://www.mdpi.com/2079-9292/8/8/832.
  27. This looks like that: Deep learning for interpretable image recognition. Advances in Neural Information Processing Systems, 32, 6 2019. ISSN 10495258. URL http://arxiv.org/abs/1806.10574.
  28. Learning to explain: An information-theoretic perspective on model interpretation. 35th International Conference on Machine Learning, ICML 2018, 2:1386–1418, 2 2018. URL http://arxiv.org/abs/1802.07814.
  29. Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations. arXiv, 2023. URL http://arxiv.org/abs/2307.08678.
  30. A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. arXiv, 2023.
  31. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  276–286, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://www.aclweb.org/anthology/W19-4828.
  32. Local Structure Matters Most: Perturbation Study in NLU. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  3712–3731, Stroudsburg, PA, USA, 7 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.293. URL https://aclanthology.org/2022.findings-acl.293.
  33. Visualizing and Measuring the Geometry of BERT. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp.  8594–8603. Curran Associates, Inc., 6 2019. URL https://proceedings.neurips.cc/paper/2019/file/159c1ffe5b61b41b3c4d8f4c2150f6c4-Paper.pdf.
  34. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2126–2136, Stroudsburg, PA, USA, 2018. Association for Computational Linguistics. ISBN 9781948087322. doi: 10.18653/v1/P18-1198. URL http://aclweb.org/anthology/P18-1198.
  35. Learning to Estimate Shapley Values with Vision Transformers. The Eleventh International Conference on Learning Representations, pp.  1–48, 2022. URL http://arxiv.org/abs/2206.05282https://openreview.net/forum?id=5ktFNz_pJLK.
  36. Machine learning in drug discovery: a review. Artificial Intelligence Review, 55(3):1947–1999, 2022.
  37. DARPA. Explainable Artificial Intelligence (XAI) DARPA-BAA-16-53. Defense Advanced Research Projects Agency (DARPA), pp.  1–52, 2016. ISSN 15580644. URL https://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf.
  38. Dauben, J. W. Georg Cantor and Pope Leo XIII: Mathematics, Theology, and the Infinite. Journal of the History of Ideas, 38(1):85, 1 1977. ISSN 00225037. doi: 10.2307/2708842. URL https://www.jstor.org/stable/2708842?origin=crossref.
  39. Towards A Rigorous Science of Interpretable Machine Learning. arXiv, 2 2017. URL http://arxiv.org/abs/1702.08608.
  40. A Mathematical Framework for Transformer Circuits. Anthropic, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.
  41. ILIME: Local and Global Interpretable Model-Agnostic Explainer of Black-Box Decision. In Welzer, T., Eder, J., Podgorelec, V., and Kamišalić Latifić, A. (eds.), Advances in Databases and Information Systems, pp.  53–68. Springer International Publishing, Cham, 2019. ISBN 978-3-030-28730-6. doi: 10.1007/978-3-030-28730-6–“˙˝4. URL http://link.springer.com/10.1007/978-3-030-28730-6_4.
  42. Elton, D. C. Self-explaining AI as an Alternative to Interpretable AI. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12177 LNAI:95–106, 2 2020. ISSN 16113349. doi: 10.1007/978-3-030-52152-3–“˙˝10. URL http://link.springer.com/10.1007/978-3-030-52152-3_10.
  43. Fashandi, H. Neural module networks: A review. Neurocomputing, 552:126518, 2023. ISSN 18728286. doi: 10.1016/j.neucom.2023.126518. URL https://doi.org/10.1016/j.neucom.2023.126518.
  44. Comparing Explanation Methods for Traditional Machine Learning Models Part 1: An Overview of Current Methods and Quantifying Their Disagreement. arXiv, pp.  1–22, 2022. URL http://arxiv.org/abs/2211.08943.
  45. Fuller, J. Companies Need More Workers. Why Do They Reject Millions of Résumés? The project on workforce, 2021. URL https://www.pw.hks.harvard.edu/post/companies-need-more-workers-wsj.
  46. Hidden Workers: Untapped Talent. Harvard Business School Project on Managing the Future of Work and Accenture, 2021. URL https://www.pw.hks.harvard.edu/post/hidden-workers-untapped-talent.
  47. Don’t trust your eyes: on the (un)reliability of feature visualizations. arXiv, 2023. URL http://arxiv.org/abs/2306.04719.
  48. Gödel, K. On Formally Undecidable Propositions of Principia Mathematica and Related Systems I. Monatshefte für Mathematik, 1931.
  49. European union regulations on algorithmic decision making and a ”right to explanation”. AI Magazine, 38(3):50–57, 2017. ISSN 07384602. doi: 10.1609/aimag.v38i3.2741.
  50. Gray, J. Did poincaré say “set theory is a disease”? The Mathematical Intelligencer, 13(1):19–22, 12 1991. ISSN 0343-6993. doi: 10.1007/BF03024067. URL https://maa.org/press/periodicals/convergence/quotations/poincare-jules-henri-1854-1912-0http://link.springer.com/10.1007/BF03024067.
  51. Neural Module Networks for Reasoning over Text. In International Conference on Learning Representations (ICLR), 12 2020. URL https://openreview.net/forum?id=SygWvAVFPr.
  52. Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations. Advances in Neural Information Processing Systems, 35(NeurIPS), 2022. ISSN 10495258. URL http://arxiv.org/abs/2206.01254.
  53. Designing and Interpreting Probes with Control Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2733–2743, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. ISBN 9781950737901. doi: 10.18653/v1/D19-1275. URL https://www.aclweb.org/anthology/D19-1275.
  54. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 11 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735.
  55. A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, volume 32, 6 2019. URL http://arxiv.org/abs/1806.10758.
  56. House of Lords, U. G. AI in the UK: Ready, Willing and Able?, 2017. URL https://publications.parliament.uk/pa/ld201719/ldselect/ldai/100/10007.htm.
  57. Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations. arXiv, 2023. URL http://arxiv.org/abs/2310.11207.
  58. Huygens, C. Traité de la lumière où sont expliquées les causes de ce qui luy arrive dans la reflexion, & dans la refraction, et particulierement dans l’etrange refraction du cistal d’Islande. Pieter van der Aa, Leiden, Netherlands, 1690. URL https://archive.org/details/bub_gb_X9PKaZlChggC/mode/2up.
  59. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4198–4205, Stroudsburg, PA, USA, 4 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://www.aclweb.org/anthology/2020.acl-main.386.
  60. Attention is not Explanation. In Proceedings of the 2019 Conference of the North, volume 1, pp.  3543–3556, Stroudsburg, PA, USA, 2 2019. Association for Computational Linguistics. ISBN 9781950737130. doi: 10.18653/v1/N19-1357. URL http://aclweb.org/anthology/N19-1357.
  61. Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in their Interpretations. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 130:1459–1467, 2021. ISSN 26403498.
  62. Fastshap: Real-Time Shapley Value Estimation. ICLR 2022 - 10th International Conference on Learning Representations, pp.  1–23, 2022.
  63. Mistral 7B. arXiv, pp.  1–9, 2023. URL http://arxiv.org/abs/2310.06825.
  64. Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10):573–584, 2020.
  65. Visualizing and Understanding Recurrent Networks. arXiv, pp.  1–12, 6 2015. URL http://arxiv.org/abs/1506.02078.
  66. Kim, B. Beyond interpretability: developing a language to shape our relationships with AI. In The International Conference on Learning Representations, 2022. URL https://iclr.cc/Conferences/2022/Schedule?showEvent=7237.
  67. The Bayesian case model: A generative approach for case-based reasoning and prototype classification. Advances in Neural Information Processing Systems, 3(January):1952–1960, 2014. ISSN 10495258.
  68. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). 35th International Conference on Machine Learning, ICML 2018, 6:4186–4195, 11 2018. URL http://arxiv.org/abs/1711.11279.
  69. The (Un)reliability of Saliency Methods. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 11700 LNCS, pp.  267–280. Springer, 11 2019. doi: 10.1007/978-3-030-28954-6–“˙˝14. URL http://link.springer.com/10.1007/978-3-030-28954-6_14.
  70. Kodiyan, A. A. An overview of ethical issues in using AI systems in hiring with a case study of Amazon’s AI based hiring tool. Researchgate Preprint, pp.  1–19, 2019.
  71. Understanding Black-box Predictions via Influence Functions. 34th International Conference on Machine Learning, ICML 2017, 4:2976–2987, 3 2017. URL http://arxiv.org/abs/1703.04730.
  72. The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective. arXiv, 2022. URL http://arxiv.org/abs/2202.01602.
  73. Kuhn, T. S. The Structure of Scientific Revolutions. University of Chicago Press, 3rd editio edition, 1996. ISBN 978-0-226-45807-6.
  74. Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv, 2023. URL http://arxiv.org/abs/2307.13702.
  75. Understanding Neural Networks through Representation Erasure. arXiv, 2016. URL http://arxiv.org/abs/1612.08220.
  76. Lipton, Z. C. The mythos of model interpretability. Communications of the ACM, 61(10):36–43, 9 2018. ISSN 0001-0782. doi: 10.1145/3233231. URL https://dl.acm.org/doi/10.1145/3233231.
  77. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv, 7 2019. ISSN 23318422. URL http://arxiv.org/abs/1907.11692.
  78. Towards Faithful Model Explanation in NLP: A Survey. arXiv, 2022. URL http://arxiv.org/abs/2209.11326.
  79. Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1731–1751, Abu Dhabi, United Arab Emirates, 12 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.125.
  80. Post-hoc Interpretability for Neural NLP: A Survey. ACM Computing Surveys, 55(8):1–42, 8 2022b. ISSN 0360-0300. doi: 10.1145/3546577. URL https://dl.acm.org/doi/10.1145/3546577.
  81. Are self-explanations from Large Language Models faithful? arXiv, 1 2024a. URL http://arxiv.org/abs/2401.07927.
  82. Faithfulness Measurable Masked Language Models. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=tw1PwpuAuNhttp://arxiv.org/abs/2310.07819.
  83. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3428–3448, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. ISBN 9781950737482. doi: 10.18653/v1/P19-1334. URL https://www.aclweb.org/anthology/P19-1334.
  84. Is Sparse Attention more Interpretable? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  122–129, Stroudsburg, PA, USA, 8 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.17. URL http://arxiv.org/abs/2106.01087https://aclanthology.org/2021.acl-short.17.
  85. Messiah, A. Quantum Mechanics. North Holland, John Wiley & Sons, 1966. ISBN 0486409244.
  86. Meta. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv, 2023. URL http://arxiv.org/abs/2307.09288.
  87. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44):22071–22080, 10 2019. ISSN 10916490. doi: 10.1073/pnas.1900654116. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1900654116.
  88. Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks. Visualization for Deep Learning workshop at ICML, 2016. URL http://arxiv.org/abs/1602.03616.
  89. Feature Visualization. Distill, 2(11), 11 2017. ISSN 2476-0757. doi: 10.23915/distill.00007. URL https://distill.pub/2017/feature-visualization.
  90. OpenAI. GPT-4 Technical Report. OpenAI, 4:1–100, 3 2023. URL http://arxiv.org/abs/2303.08774.
  91. Interpretable deep learning in drug discovery. Explainable AI: interpreting, explaining and visualizing deep learning, pp.  331–345, 2019.
  92. Estimating Training Data Influence by Tracing Gradient Descent. In Advances in Neural Information Processing Systems, 2 2020. URL http://arxiv.org/abs/2002.08484.
  93. ”Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 13-17-Augu, pp.  1135–1144, New York, NY, USA, 8 2016. ACM. ISBN 9781450342322. doi: 10.1145/2939672.2939778. URL https://dl.acm.org/doi/10.1145/2939672.2939778.
  94. Perturbation-Based Explanations of Prediction Models. Springer International Publishing, 2018. ISBN 9783319904030. doi: 10.1007/978-3-319-90403-0–“˙˝9. URL http://dx.doi.org/10.1007/978-3-319-90403-0_9.
  95. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866, 12 2020. ISSN 2307-387X. doi: 10.1162/tacl–“˙˝a–“˙˝00349. URL https://direct.mit.edu/tacl/article/96482.
  96. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing Their Input Gradients. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1):1660–1669, 4 2018. ISSN 2374-3468. doi: 10.1609/aaai.v32i1.11504. URL https://ojs.aaai.org/index.php/AAAI/article/view/11504.
  97. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019. ISSN 2522-5839. doi: 10.1038/s42256-019-0048-x. URL http://www.nature.com/articles/s42256-019-0048-x.
  98. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 11 2017. ISSN 2162-237X. doi: 10.1109/TNNLS.2016.2599820. URL https://ieeexplore.ieee.org/document/7552539/.
  99. Guided-LIME: Structured sampling based hybrid approach towards explaining blackbox machine learning models. In CEUR Workshop Proceedings, volume 2699, 2020.
  100. Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero. arXiv, pp.  1–61, 10 2023. URL http://arxiv.org/abs/2310.16410.
  101. Noise-adding Methods of Saliency Map as Series of Higher Order Partial Derivative. In 2018 ICML Workshop on Human Interpretability in Machine Learning, 6 2018. URL http://arxiv.org/abs/1806.03000.
  102. Is Attention Interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2931–2951, Stroudsburg, PA, USA, 6 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1282. URL https://www.aclweb.org/anthology/P19-1282.
  103. Learning important features through propagating activation differences. In 34th International Conference on Machine Learning, ICML 2017, volume 7, pp.  4844–4866, 2017. ISBN 9781510855144. URL https://arxiv.org/.
  104. Fooling LIME and SHAP. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp.  180–186, New York, NY, USA, 2 2020. ACM. ISBN 9781450371100. doi: 10.1145/3375627.3375830. URL https://dl.acm.org/doi/10.1145/3375627.3375830.
  105. SmoothGrad: removing noise by adding noise. ICML workshop on visualization for deep learning, 2017. ISSN 23318422. URL https://goo.gl/EfVzEE.
  106. Smith, J. David Hilbert’s Radio Address. Convergence, 2014. doi: 10.4169/convergence20140202. URL https://www.maa.org/node/326611.
  107. Efficiently Training Low-Curvature Neural Networks. Advances in Neural Information Processing Systems, 35(NeurIPS):1–21, 6 2022. ISSN 10495258. URL http://arxiv.org/abs/2206.07144.
  108. Obtaining faithful interpretations from compositional neural networks. Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp.  5594–5608, 2020. ISSN 0736587X. doi: 10.18653/v1/2020.acl-main.495. URL https://www.aclweb.org/anthology/2020.acl-main.495.
  109. Axiomatic attribution for deep networks. In 34th International Conference on Machine Learning, ICML 2017, volume 7, pp.  5109–5118, 3 2017. ISBN 9781510855144. URL http://arxiv.org/abs/1703.01365.
  110. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4593–4601, Stroudsburg, PA, USA, 5 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://www.aclweb.org/anthology/P19-1452.
  111. Generating Token-Level Explanations for Natural Language Inference. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), volume 1, pp.  963–969, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. ISBN 9781950737130. doi: 10.18653/v1/N19-1101. URL http://aclweb.org/anthology/N19-1101.
  112. Attention Interpretability Across NLP Tasks. arXiv, 9 2019. URL http://arxiv.org/abs/1909.11218.
  113. Information-Theoretic Probing with Minimum Description Length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  183–196, Stroudsburg, PA, USA, 3 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL https://www.aclweb.org/anthology/2020.emnlp-main.14.
  114. On the Legal Compatibility of Fairness Definitions. Workshop on Human-Centric Machine Learning at the 33rd Conference on Neural Information Processing Systems, 2019. URL http://arxiv.org/abs/1912.00761.
  115. Representer Point Selection for Explaining Deep Neural Networks. In Advances in Neural Information Processing Systems, pp.  9291–9301, 11 2018. URL http://arxiv.org/abs/1811.09720.
  116. On the (In)fidelity and Sensitivity of Explanations. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp.  10967–10978. Curran Associates, Inc., Vancouver, Canada, 2019. URL https://arxiv.org/abs/1901.09392.
  117. Invase: Instance-wise variable selection using neural networks. 7th International Conference on Learning Representations, ICLR 2019, pp.  1–24, 2019.
  118. Understanding Neural Networks Through Deep Visualization. In Deep Learning Workshop at 31st International Conference on Machine Learning, 2015. URL http://arxiv.org/abs/1506.06579.
  119. Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  359–361, Stroudsburg, PA, USA, 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5448. URL http://aclweb.org/anthology/W18-5448.
  120. The Solvability of Interpretability Evaluation Metrics. In Findings of the Association for Computational Linguistics: EACL, 2023. URL http://arxiv.org/abs/2205.08696.
  121. How Well do Feature Visualizations Support Causal Understanding of CNN Activations? Advances in Neural Information Processing Systems, 14(NeurIPS):11730–11744, 2021. ISSN 10495258.
Citations (1)

Summary

  • The paper demonstrates that integrating faithfulness metrics into model design can improve the reliability and clarity of AI explanations.
  • It critically examines intrinsic and post-hoc methods, emphasizing the risks of unfaithful yet plausible interpretations.
  • By introducing approaches like FMMs and self-explaining models, the paper offers actionable insights for developing more transparent and accountable AI systems.

Exploring New Paradigms in Model Interpretability

Introduction to Interpretability Paradigms

Interpretability in ML refers to our ability to decipher, in simple human terms, why and how a model makes certain decisions. Traditionally, interpretability has been segmented into two dominant paradigms: the intrinsic and post-hoc approaches.

  • Intrinsic paradigm: This viewpoint holds that models must be inherently interpretable; meaning clear, understandable decision processes must be woven into the architecture of the model itself. Classic examples include decision trees or linear models where the reasoning is straightforward and visible in the model's structure.
  • Post-hoc paradigm: This perspective asserts that explanations can be derived from complex models (often considered "black-box" due to their opaque nature) after they have been trained. Techniques like feature importance derived from model outputs are used to interpret these models.

Both paradigms have their merits, but also significant limitations, leading researchers to propose and evaluate the emergence of new paradigms that might better address these flaws.

Limitations of Current Paradigms

The existing paradigms often fall short in terms of faithfulness, a term used to describe how accurately an explanation represents the operations and decisions of a model. Unfaithful explanations can be misleading, potentially causing more harm than good by engendering false confidence in the decisions made by AI systems.

  • Intrinsic models: Though they provide a direct route to interpretability, they can be limited in performance and flexibility. Additionally, parts of even inherently interpretable models can remain opaque, such as certain layers in a neural network not directly contributing to the interpretability.
  • Post-hoc explanations: These can be broadly applicable and useful, especially for complex models, but often at the cost of accuracy in the interpretation. They may fail to capture true causal relationships within model decisions, leading to potentially misleading interpretations.

Emerging Paradigms in Interpretability

Responding to the deficiencies in traditional paradigms, researchers have begun to sketch out potential new frameworks that can offer both high performance and faithful explanations:

  1. Inherently Faithfulness Measurable Models (FMMs):
    • These models are designed not to be inherently interpretable directly but to make the measurement of an explanation's faithfulness straightforward and accurate.
    • A demonstrated approach involves modifying specific model types, such as RoBERTa, to accommodate direct and reliable faithfulness assessments without additional training or computational costs.
  2. Models That Learn to Explain Faithfully:
    • Unlike traditional post-hoc methods, this paradigm focuses on optimizing models so that they naturally generate more faithful explanations.
    • This can involve novel training regimes or architectural tweaks that encourage the model to consider explanation quality during the training phase.
  3. Self-explaining Models:
    • This concept pushes the idea further by suggesting that models should not only function well but also generate their own explanations as part of the output.
    • These models hold the potential for deep integration of interpretability, though ensuring the faithfulness of their self-generated explanations remains a critical challenge.

Future Directions and Caution

While these emerging paradigms show promise, they also introduce new complexities and risks. Ensuring the faithfulness of explanations remains paramount, as unfaithful but plausible explanations could lead to misguided trust in AI systems. Furthermore, the definition and measurement of faithfulness need to be precise and standardized to prevent inconsistencies and preserve the integrity of interpretability research.

Conclusion

The field of AI interpretability is at a crossroads, with significant opportunities for innovation in how we make complex models understandable and accountable. By exploring and developing new paradigms, we can hope to achieve models that are not only performant but also transparent and trustworthy in their decision-making processes. This exploration, while challenging, is crucial for the safe and ethical advancement of AI technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com