Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear Explanations for Individual Neurons (2405.06855v1)

Published 10 May 2024 in cs.LG and cs.CV

Abstract: In recent years many methods have been developed to understand the internal workings of neural networks, often by describing the function of individual neurons in the model. However, these methods typically only focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient, and that the highest activation range is only responsible for a very small percentage of the neuron's causal effect. In addition, inputs causing lower activations are often very different and can't be reliably predicted by only looking at high activations. We propose that neurons should instead be understood as a linear combination of concepts, and develop an efficient method for producing these linear explanations. In addition, we show how to automatically evaluate description quality using simulation, i.e. predicting neuron activations on unseen inputs in vision setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Understanding intermediate layers using linear classifier probes, 2018.
  2. Describe-and-dissect: Interpreting neurons in vision networks with language models, 2024.
  3. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
  4. Understanding the role of individual units in a deep neural network. PNAS, 2020.
  5. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
  6. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  7. Labeling neural representations with inverse recognition. In NeurIPS, 2023.
  8. Sparse autoencoders find highly interpretable features in language models, 2023.
  9. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  10. Privileged bases in the transformer residual stream, 2023. URL https://transformer-circuits.pub/2023/privileged-basis/.
  11. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
  12. Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In ICCV, 2018.
  13. Don’t trust your eyes: on the (un)reliability of feature visualizations, 2023.
  14. Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons.
  15. On calibration of modern neural networks. In ICML, 2017.
  16. Finding neurons in a haystack: Case studies with sparse probing, 2023.
  17. Deep residual learning for image recognition. In CVPR, 2016.
  18. Natural language descriptions of deep visual features. In ICLR, 2022.
  19. Identifying interpretable subspaces in image representations. In ICML, 2023.
  20. The importance of prompt tuning for automated neuron explanations. In NeurIPS ATTRIB Workshop, 2023.
  21. Miller, G. A. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  22. Compositional explanations of neurons. In NeurIPS, 2020.
  23. Adversarial attacks on the interpretation of neuron activation maximization, 2023.
  24. Clip-dissect: Automatic description of neuron representations in deep vision networks. In ICLR, 2023.
  25. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
  26. Learning transferable visual models from natural language supervision, 2021.
  27. ”why should i trust you?” explaining the predictions of any classifier. In KDD, 2016.
  28. Towards a fuller understanding of neurons with clustered compositional explanations. In NeurIPS, 2023.
  29. Neuron-level interpretation of deep nlp models: A survey. In TACL, 2022.
  30. Corrupting neuron explanations of deep visual features. In ICCV, 2023.
  31. Leveraging sparse linear layers for debuggable deep networks. In ICML, 2021.
  32. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
  33. Sigmoid loss for language image pre-training. In ICCV, 2023.
  34. Object detectors emerge in deep scene cnns. In ICLR, 2015.
  35. Scale alone does not improve mechanistic interpretability in vision models. In NeurIPS, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Tuomas Oikarinen (14 papers)
  2. Tsui-Wei Weng (51 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.