Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG) (2409.01610v1)
Abstract: In the field of eXplainable AI (XAI) in LLMs, the progression from local explanations of individual decisions to global explanations with high-level concepts has laid the groundwork for mechanistic interpretability, which aims to decode the exact operations. However, this paradigm has not been adequately explored in image models, where existing methods have primarily focused on class-specific interpretations. This paper introduces a novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset. We utilize Pointwise Feature Vectors (PFVs) and Effective Receptive Fields (ERFs) to decompose model embeddings into interpretable Concept Vectors. Then, we calculate the relevance between concept vectors with our Generalized Integrated Gradients (GIG), enabling a comprehensive, dataset-wide analysis of model behavior. We validate our method of concept extraction and concept attribution in both qualitative and quantitative evaluations. Our approach advances the understanding of semantic significance within image models, offering a holistic view of their operational mechanics.
- From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5(9):1006–1019, 2023. doi: 10.1038/s42256-023-00711-8. URL https://doi.org/10.1038/s42256-023-00711-8.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- On the relationship between self-attention and convolutional layers. In Eighth International Conference on Learning Representations-ICLR 2020, 2020.
- A holistic approach to unifying automatic concept extraction and concept importance estimation, 2023a.
- Craft: Concept recursive activation factorization for explainability, 2023b. URL https://arxiv.org/abs/2211.10154.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space, 2022. URL https://arxiv.org/abs/2203.14680.
- Towards automatic concept-based explanations, 2019. URL https://arxiv.org/abs/1902.03129.
- Universal neurons in gpt2 language models, 2024. URL https://arxiv.org/abs/2401.12181.
- Respect the model: Fine-grained and robust explanation with sharing ratio decomposition. arXiv preprint arXiv:2402.03348, 2024.
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018. URL https://arxiv.org/abs/1711.11279.
- Visual concept connectome (vcc): Open world concept discovery and their interlayer connections in deep models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10895–10905, 2024.
- A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
- Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, May 2017. ISSN 0031-3203. doi: 10.1016/j.patcog.2016.11.008. URL http://dx.doi.org/10.1016/j.patcog.2016.11.008.
- ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144, 2016.
- Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR, abs/1610.02391, 2016. URL http://arxiv.org/abs/1610.02391.
- Rethinking interpretability in the era of large language models, 2024. URL https://arxiv.org/abs/2402.01761.
- A comparison of document clustering techniques. Proceedings of the International KDD Workshop on Text Mining, 06 2000.
- Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319–3328. PMLR, 2017.
- Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
- Causal proxy models for concept-based model explanations, 2022. URL https://arxiv.org/abs/2209.14279.
- Global concept-based interpretability for graph neural networks via neuron analysis, 2023. URL https://arxiv.org/abs/2208.10609.
Collections
Sign up for free to add this paper to one or more collections.