Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits (2404.06453v1)

Published 9 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5(9):1006–1019, 2023.
  2. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
  3. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017.
  4. DORA: Exploring outlier representations in deep neural networks. In ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML, 2023.
  5. Labeling neural representations with inverse recognition. Advances in Neural Information Processing Systems, 36, 2024.
  6. Thread: Circuits. Distill, 2020. https://distill.pub/2020/circuits.
  7. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024.
  8. Understanding the (extra-)ordinary: Validating deep model decisions with prototypical concept-based explanations. arXiv preprint arXiv:2311.16681, 2023.
  9. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  10. A holistic approach to unifying automatic concept extraction and concept importance estimation. Advances in Neural Information Processing Systems, 36, 2024.
  11. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36, 2024.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  13. Identifying interpretable subspaces in image representations. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.
  14. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
  15. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 2018.
  16. Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pages 193–209, 2019.
  17. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616, 2016.
  18. Feature visualization. Distill, 2(11):e7, 2017.
  19. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  20. Disentangling neuron representations with concept vectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3770–3775, 2023.
  21. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  23. Automatic discovery of visual circuits. In NeurIPS Workshop on Attributing Model Behavior at Scale, 2023.
  24. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 464–483, 2022.
  25. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  26. Learning important features through propagating activation differences. In International conference on machine learning, pages 3145–3153. PMLR, 2017.
  27. Multi-dimensional concept discovery (mcd): A unifying framework with completeness guarantees. Transactions on Machine Learning Research, 2023.
  28. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023.
  29. Resnet strikes back: An improved training procedure in timm. In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021.
  30. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, Los Alamitos, CA, USA, 2018. IEEE Computer Society.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets