Interpretability Needs a New Paradigm (2405.05386v2)
Abstract: Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in AI, which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.
- Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, volume 2018-Decem, pp. 9505–9515. Curran Associates, Inc., 10 2018. URL http://arxiv.org/abs/1810.03292.
- Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation. In International Conference on Learning Representations, pp. 1–13, 2021. URL https://openreview.net/forum?id=xNOVfCCvDpM.
- Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models. arXiv, 2024. URL http://arxiv.org/abs/2402.04614.
- Normlime: A new feature importance metric for explaining deep neural networks. arXiv, 9 2019. ISSN 23318422. URL http://arxiv.org/abs/1909.04200.
- A review of modularization techniques in artificial neural networks. Artificial Intelligence Review, 52(1):527–561, 6 2019. ISSN 0269-2821. doi: 10.1007/s10462-019-09706-7. URL http://link.springer.com/10.1007/s10462-019-09706-7.
- Neural Module Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 39–48. IEEE, 6 2016. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.12. URL http://ieeexplore.ieee.org/document/7780381/.
- What we can’t measure, We can’t understand: Challenges to demographic data procurement in the pursuit of fairness. FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 249–260, 2021. doi: 10.1145/3442188.3445888.
- Anthropic Team. Language Models (Mostly) Know What They Know. Anthropic, 7 2022. URL http://arxiv.org/abs/2207.05221.
- One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques. arXiv, 9 2019. ISSN 23318422. URL http://arxiv.org/abs/1909.03012.
- How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 12 2010. ISSN 15324435. URL http://arxiv.org/abs/0912.1128.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15. International Conference on Learning Representations, ICLR, 9 2015. URL https://arxiv.org/abs/1409.0473.
- SAM: The Sensitivity of Attribution Methods to Hyperparameters. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 11–21. IEEE, 6 2020. ISBN 978-1-7281-9360-1. doi: 10.1109/CVPRW50498.2020.00009. URL https://ieeexplore.ieee.org/document/9150607/.
- Fairness and Machine Learning: Limitations and Opportunities. fairmlbook.org, 2019. URL https://fairmlbook.org/.
- The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 149–155, Stroudsburg, PA, USA, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.14. URL https://www.aclweb.org/anthology/2020.blackboxnlp-1.14.
- “Will You Find These Shortcuts?” A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 976–991, Stroudsburg, PA, USA, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.64. URL https://aclanthology.org/2022.emnlp-main.64.
- Belinkov, Y. Probing Classifiers: Promises, Shortcomings, and Advances. arXiv, pp. 1–12, 2 2021. URL http://arxiv.org/abs/2102.12452.
- Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72, 4 2019. ISSN 2307-387X. doi: 10.1162/tacl–“˙˝a–“˙˝00254. URL https://doi.org/10.1162/tacl_a_00254.
- Interpretability and Analysis in Neural NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 1–5, Stroudsburg, PA, USA, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-tutorials.1. URL https://www.aclweb.org/anthology/2020.acl-tutorials.1.
- Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability. arXiv, pp. 1–14, 2023. URL http://arxiv.org/abs/2307.15007.
- Explainable Machine Learning in Deployment. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 648–657, 9 2019. doi: 10.1145/3351095.3375624. URL https://dl.acm.org/doi/10.1145/3351095.3375624.
- Classification by Set Cover: The Prototype Vector Machine. arXiv, pp. 1–24, 2009. URL http://arxiv.org/abs/0908.2284.
- Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):1–36, 1 2024. ISSN 0027-8424. doi: 10.1073/pnas.2304406120. URL https://pnas.org/doi/10.1073/pnas.2304406120.
- Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers. In Artificial Neural Networks and Machine Learning – ICANN 2016, volume 9887 LNCS, pp. 63–71, 2016. doi: 10.1007/978-3-319-44781-0–“˙˝8. URL http://link.springer.com/10.1007/978-3-319-44781-0_8.
- Exemplary Natural Images Explain Cnn Activations Better Than State-of-the-Art Feature Visualization. ICLR 2021 - 9th International Conference on Learning Representations, pp. 1–41, 2021.
- Thread: Circuits. Distill, 5(3), 3 2020. ISSN 2476-0757. doi: 10.23915/distill.00024. URL https://distill.pub/2020/circuits.
- Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics, 8(8):832, 7 2019. ISSN 2079-9292. doi: 10.3390/electronics8080832. URL https://www.mdpi.com/2079-9292/8/8/832.
- This looks like that: Deep learning for interpretable image recognition. Advances in Neural Information Processing Systems, 32, 6 2019. ISSN 10495258. URL http://arxiv.org/abs/1806.10574.
- Learning to explain: An information-theoretic perspective on model interpretation. 35th International Conference on Machine Learning, ICML 2018, 2:1386–1418, 2 2018. URL http://arxiv.org/abs/1802.07814.
- Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations. arXiv, 2023. URL http://arxiv.org/abs/2307.08678.
- A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. arXiv, 2023.
- What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 276–286, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://www.aclweb.org/anthology/W19-4828.
- Local Structure Matters Most: Perturbation Study in NLU. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 3712–3731, Stroudsburg, PA, USA, 7 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.293. URL https://aclanthology.org/2022.findings-acl.293.
- Visualizing and Measuring the Geometry of BERT. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 8594–8603. Curran Associates, Inc., 6 2019. URL https://proceedings.neurips.cc/paper/2019/file/159c1ffe5b61b41b3c4d8f4c2150f6c4-Paper.pdf.
- What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2126–2136, Stroudsburg, PA, USA, 2018. Association for Computational Linguistics. ISBN 9781948087322. doi: 10.18653/v1/P18-1198. URL http://aclweb.org/anthology/P18-1198.
- Learning to Estimate Shapley Values with Vision Transformers. The Eleventh International Conference on Learning Representations, pp. 1–48, 2022. URL http://arxiv.org/abs/2206.05282https://openreview.net/forum?id=5ktFNz_pJLK.
- Machine learning in drug discovery: a review. Artificial Intelligence Review, 55(3):1947–1999, 2022.
- DARPA. Explainable Artificial Intelligence (XAI) DARPA-BAA-16-53. Defense Advanced Research Projects Agency (DARPA), pp. 1–52, 2016. ISSN 15580644. URL https://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf.
- Dauben, J. W. Georg Cantor and Pope Leo XIII: Mathematics, Theology, and the Infinite. Journal of the History of Ideas, 38(1):85, 1 1977. ISSN 00225037. doi: 10.2307/2708842. URL https://www.jstor.org/stable/2708842?origin=crossref.
- Towards A Rigorous Science of Interpretable Machine Learning. arXiv, 2 2017. URL http://arxiv.org/abs/1702.08608.
- A Mathematical Framework for Transformer Circuits. Anthropic, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.
- ILIME: Local and Global Interpretable Model-Agnostic Explainer of Black-Box Decision. In Welzer, T., Eder, J., Podgorelec, V., and Kamišalić Latifić, A. (eds.), Advances in Databases and Information Systems, pp. 53–68. Springer International Publishing, Cham, 2019. ISBN 978-3-030-28730-6. doi: 10.1007/978-3-030-28730-6–“˙˝4. URL http://link.springer.com/10.1007/978-3-030-28730-6_4.
- Elton, D. C. Self-explaining AI as an Alternative to Interpretable AI. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12177 LNAI:95–106, 2 2020. ISSN 16113349. doi: 10.1007/978-3-030-52152-3–“˙˝10. URL http://link.springer.com/10.1007/978-3-030-52152-3_10.
- Fashandi, H. Neural module networks: A review. Neurocomputing, 552:126518, 2023. ISSN 18728286. doi: 10.1016/j.neucom.2023.126518. URL https://doi.org/10.1016/j.neucom.2023.126518.
- Comparing Explanation Methods for Traditional Machine Learning Models Part 1: An Overview of Current Methods and Quantifying Their Disagreement. arXiv, pp. 1–22, 2022. URL http://arxiv.org/abs/2211.08943.
- Fuller, J. Companies Need More Workers. Why Do They Reject Millions of Résumés? The project on workforce, 2021. URL https://www.pw.hks.harvard.edu/post/companies-need-more-workers-wsj.
- Hidden Workers: Untapped Talent. Harvard Business School Project on Managing the Future of Work and Accenture, 2021. URL https://www.pw.hks.harvard.edu/post/hidden-workers-untapped-talent.
- Don’t trust your eyes: on the (un)reliability of feature visualizations. arXiv, 2023. URL http://arxiv.org/abs/2306.04719.
- Gödel, K. On Formally Undecidable Propositions of Principia Mathematica and Related Systems I. Monatshefte für Mathematik, 1931.
- European union regulations on algorithmic decision making and a ”right to explanation”. AI Magazine, 38(3):50–57, 2017. ISSN 07384602. doi: 10.1609/aimag.v38i3.2741.
- Gray, J. Did poincaré say “set theory is a disease”? The Mathematical Intelligencer, 13(1):19–22, 12 1991. ISSN 0343-6993. doi: 10.1007/BF03024067. URL https://maa.org/press/periodicals/convergence/quotations/poincare-jules-henri-1854-1912-0http://link.springer.com/10.1007/BF03024067.
- Neural Module Networks for Reasoning over Text. In International Conference on Learning Representations (ICLR), 12 2020. URL https://openreview.net/forum?id=SygWvAVFPr.
- Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations. Advances in Neural Information Processing Systems, 35(NeurIPS), 2022. ISSN 10495258. URL http://arxiv.org/abs/2206.01254.
- Designing and Interpreting Probes with Control Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2733–2743, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. ISBN 9781950737901. doi: 10.18653/v1/D19-1275. URL https://www.aclweb.org/anthology/D19-1275.
- Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 11 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735.
- A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, volume 32, 6 2019. URL http://arxiv.org/abs/1806.10758.
- House of Lords, U. G. AI in the UK: Ready, Willing and Able?, 2017. URL https://publications.parliament.uk/pa/ld201719/ldselect/ldai/100/10007.htm.
- Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations. arXiv, 2023. URL http://arxiv.org/abs/2310.11207.
- Huygens, C. Traité de la lumière où sont expliquées les causes de ce qui luy arrive dans la reflexion, & dans la refraction, et particulierement dans l’etrange refraction du cistal d’Islande. Pieter van der Aa, Leiden, Netherlands, 1690. URL https://archive.org/details/bub_gb_X9PKaZlChggC/mode/2up.
- Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198–4205, Stroudsburg, PA, USA, 4 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://www.aclweb.org/anthology/2020.acl-main.386.
- Attention is not Explanation. In Proceedings of the 2019 Conference of the North, volume 1, pp. 3543–3556, Stroudsburg, PA, USA, 2 2019. Association for Computational Linguistics. ISBN 9781950737130. doi: 10.18653/v1/N19-1357. URL http://aclweb.org/anthology/N19-1357.
- Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in their Interpretations. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 130:1459–1467, 2021. ISSN 26403498.
- Fastshap: Real-Time Shapley Value Estimation. ICLR 2022 - 10th International Conference on Learning Representations, pp. 1–23, 2022.
- Mistral 7B. arXiv, pp. 1–9, 2023. URL http://arxiv.org/abs/2310.06825.
- Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10):573–584, 2020.
- Visualizing and Understanding Recurrent Networks. arXiv, pp. 1–12, 6 2015. URL http://arxiv.org/abs/1506.02078.
- Kim, B. Beyond interpretability: developing a language to shape our relationships with AI. In The International Conference on Learning Representations, 2022. URL https://iclr.cc/Conferences/2022/Schedule?showEvent=7237.
- The Bayesian case model: A generative approach for case-based reasoning and prototype classification. Advances in Neural Information Processing Systems, 3(January):1952–1960, 2014. ISSN 10495258.
- Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). 35th International Conference on Machine Learning, ICML 2018, 6:4186–4195, 11 2018. URL http://arxiv.org/abs/1711.11279.
- The (Un)reliability of Saliency Methods. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 11700 LNCS, pp. 267–280. Springer, 11 2019. doi: 10.1007/978-3-030-28954-6–“˙˝14. URL http://link.springer.com/10.1007/978-3-030-28954-6_14.
- Kodiyan, A. A. An overview of ethical issues in using AI systems in hiring with a case study of Amazon’s AI based hiring tool. Researchgate Preprint, pp. 1–19, 2019.
- Understanding Black-box Predictions via Influence Functions. 34th International Conference on Machine Learning, ICML 2017, 4:2976–2987, 3 2017. URL http://arxiv.org/abs/1703.04730.
- The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective. arXiv, 2022. URL http://arxiv.org/abs/2202.01602.
- Kuhn, T. S. The Structure of Scientific Revolutions. University of Chicago Press, 3rd editio edition, 1996. ISBN 978-0-226-45807-6.
- Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv, 2023. URL http://arxiv.org/abs/2307.13702.
- Understanding Neural Networks through Representation Erasure. arXiv, 2016. URL http://arxiv.org/abs/1612.08220.
- Lipton, Z. C. The mythos of model interpretability. Communications of the ACM, 61(10):36–43, 9 2018. ISSN 0001-0782. doi: 10.1145/3233231. URL https://dl.acm.org/doi/10.1145/3233231.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv, 7 2019. ISSN 23318422. URL http://arxiv.org/abs/1907.11692.
- Towards Faithful Model Explanation in NLP: A Survey. arXiv, 2022. URL http://arxiv.org/abs/2209.11326.
- Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1731–1751, Abu Dhabi, United Arab Emirates, 12 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.125.
- Post-hoc Interpretability for Neural NLP: A Survey. ACM Computing Surveys, 55(8):1–42, 8 2022b. ISSN 0360-0300. doi: 10.1145/3546577. URL https://dl.acm.org/doi/10.1145/3546577.
- Are self-explanations from Large Language Models faithful? arXiv, 1 2024a. URL http://arxiv.org/abs/2401.07927.
- Faithfulness Measurable Masked Language Models. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=tw1PwpuAuNhttp://arxiv.org/abs/2310.07819.
- Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3428–3448, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. ISBN 9781950737482. doi: 10.18653/v1/P19-1334. URL https://www.aclweb.org/anthology/P19-1334.
- Is Sparse Attention more Interpretable? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 122–129, Stroudsburg, PA, USA, 8 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.17. URL http://arxiv.org/abs/2106.01087https://aclanthology.org/2021.acl-short.17.
- Messiah, A. Quantum Mechanics. North Holland, John Wiley & Sons, 1966. ISBN 0486409244.
- Meta. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv, 2023. URL http://arxiv.org/abs/2307.09288.
- Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44):22071–22080, 10 2019. ISSN 10916490. doi: 10.1073/pnas.1900654116. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1900654116.
- Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks. Visualization for Deep Learning workshop at ICML, 2016. URL http://arxiv.org/abs/1602.03616.
- Feature Visualization. Distill, 2(11), 11 2017. ISSN 2476-0757. doi: 10.23915/distill.00007. URL https://distill.pub/2017/feature-visualization.
- OpenAI. GPT-4 Technical Report. OpenAI, 4:1–100, 3 2023. URL http://arxiv.org/abs/2303.08774.
- Interpretable deep learning in drug discovery. Explainable AI: interpreting, explaining and visualizing deep learning, pp. 331–345, 2019.
- Estimating Training Data Influence by Tracing Gradient Descent. In Advances in Neural Information Processing Systems, 2 2020. URL http://arxiv.org/abs/2002.08484.
- ”Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 13-17-Augu, pp. 1135–1144, New York, NY, USA, 8 2016. ACM. ISBN 9781450342322. doi: 10.1145/2939672.2939778. URL https://dl.acm.org/doi/10.1145/2939672.2939778.
- Perturbation-Based Explanations of Prediction Models. Springer International Publishing, 2018. ISBN 9783319904030. doi: 10.1007/978-3-319-90403-0–“˙˝9. URL http://dx.doi.org/10.1007/978-3-319-90403-0_9.
- A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866, 12 2020. ISSN 2307-387X. doi: 10.1162/tacl–“˙˝a–“˙˝00349. URL https://direct.mit.edu/tacl/article/96482.
- Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing Their Input Gradients. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1):1660–1669, 4 2018. ISSN 2374-3468. doi: 10.1609/aaai.v32i1.11504. URL https://ojs.aaai.org/index.php/AAAI/article/view/11504.
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019. ISSN 2522-5839. doi: 10.1038/s42256-019-0048-x. URL http://www.nature.com/articles/s42256-019-0048-x.
- Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 11 2017. ISSN 2162-237X. doi: 10.1109/TNNLS.2016.2599820. URL https://ieeexplore.ieee.org/document/7552539/.
- Guided-LIME: Structured sampling based hybrid approach towards explaining blackbox machine learning models. In CEUR Workshop Proceedings, volume 2699, 2020.
- Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero. arXiv, pp. 1–61, 10 2023. URL http://arxiv.org/abs/2310.16410.
- Noise-adding Methods of Saliency Map as Series of Higher Order Partial Derivative. In 2018 ICML Workshop on Human Interpretability in Machine Learning, 6 2018. URL http://arxiv.org/abs/1806.03000.
- Is Attention Interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2931–2951, Stroudsburg, PA, USA, 6 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1282. URL https://www.aclweb.org/anthology/P19-1282.
- Learning important features through propagating activation differences. In 34th International Conference on Machine Learning, ICML 2017, volume 7, pp. 4844–4866, 2017. ISBN 9781510855144. URL https://arxiv.org/.
- Fooling LIME and SHAP. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180–186, New York, NY, USA, 2 2020. ACM. ISBN 9781450371100. doi: 10.1145/3375627.3375830. URL https://dl.acm.org/doi/10.1145/3375627.3375830.
- SmoothGrad: removing noise by adding noise. ICML workshop on visualization for deep learning, 2017. ISSN 23318422. URL https://goo.gl/EfVzEE.
- Smith, J. David Hilbert’s Radio Address. Convergence, 2014. doi: 10.4169/convergence20140202. URL https://www.maa.org/node/326611.
- Efficiently Training Low-Curvature Neural Networks. Advances in Neural Information Processing Systems, 35(NeurIPS):1–21, 6 2022. ISSN 10495258. URL http://arxiv.org/abs/2206.07144.
- Obtaining faithful interpretations from compositional neural networks. Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 5594–5608, 2020. ISSN 0736587X. doi: 10.18653/v1/2020.acl-main.495. URL https://www.aclweb.org/anthology/2020.acl-main.495.
- Axiomatic attribution for deep networks. In 34th International Conference on Machine Learning, ICML 2017, volume 7, pp. 5109–5118, 3 2017. ISBN 9781510855144. URL http://arxiv.org/abs/1703.01365.
- BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601, Stroudsburg, PA, USA, 5 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://www.aclweb.org/anthology/P19-1452.
- Generating Token-Level Explanations for Natural Language Inference. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), volume 1, pp. 963–969, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. ISBN 9781950737130. doi: 10.18653/v1/N19-1101. URL http://aclweb.org/anthology/N19-1101.
- Attention Interpretability Across NLP Tasks. arXiv, 9 2019. URL http://arxiv.org/abs/1909.11218.
- Information-Theoretic Probing with Minimum Description Length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196, Stroudsburg, PA, USA, 3 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL https://www.aclweb.org/anthology/2020.emnlp-main.14.
- On the Legal Compatibility of Fairness Definitions. Workshop on Human-Centric Machine Learning at the 33rd Conference on Neural Information Processing Systems, 2019. URL http://arxiv.org/abs/1912.00761.
- Representer Point Selection for Explaining Deep Neural Networks. In Advances in Neural Information Processing Systems, pp. 9291–9301, 11 2018. URL http://arxiv.org/abs/1811.09720.
- On the (In)fidelity and Sensitivity of Explanations. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 10967–10978. Curran Associates, Inc., Vancouver, Canada, 2019. URL https://arxiv.org/abs/1901.09392.
- Invase: Instance-wise variable selection using neural networks. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–24, 2019.
- Understanding Neural Networks Through Deep Visualization. In Deep Learning Workshop at 31st International Conference on Machine Learning, 2015. URL http://arxiv.org/abs/1506.06579.
- Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 359–361, Stroudsburg, PA, USA, 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5448. URL http://aclweb.org/anthology/W18-5448.
- The Solvability of Interpretability Evaluation Metrics. In Findings of the Association for Computational Linguistics: EACL, 2023. URL http://arxiv.org/abs/2205.08696.
- How Well do Feature Visualizations Support Causal Understanding of CNN Activations? Advances in Neural Information Processing Systems, 14(NeurIPS):11730–11744, 2021. ISSN 10495258.