Eliciting Latent Predictions from Transformers with the Tuned Lens (2303.08112v4)
Abstract: We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique, which yielded useful insights but is often brittle. We test our method on various autoregressive LLMs with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.
- John Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982.
- Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
- GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 8 2021. URL https://www.github.com/eleutherai/gpt-neox.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876–10889, 2021.
- Revisiting model stitching to compare neural representations. Advances in Neural Information Processing Systems, 34:225–236, 2021.
- Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Datasheet for the Pile. Computing Research Repository, 2022. doi: 10.48550/arXiv.2201.07311. URL https://arxiv.org/abs/2201.07311v1. version 1.
- Pythia: a scaling suite for language model interpretability research. Computing Research Repository, 2023. doi: 10.48550/arXiv.2201.07311. URL https://arxiv.org/abs/2201.07311v1. version 1.
- Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58, 2021.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
- Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Causal scrubbing: a method for rigorously testing interpretability hypotheses. Alignment Forum, 2022. URL https://bit.ly/3WRBhPD.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Similarity and matching of neural network representations. Advances in Neural Information Processing Systems, 34:5656–5668, 2021.
- Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022.acl-long.581.
- Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535, 2022.
- Foresight: Its logical laws, its subjective sources. Breakthroughs in statistics, 1:134–174, 1937.
- Jump to conclusions: Short-cutting transformers with linear transformations. arXiv preprint arXiv:2303.09435, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Changing the reference measure in the simplex and its weighting effects. Austrian Journal of Statistics, 45(4):25–44, 2016.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
- The Pile: An 800GB dataset of diverse text for language modeling. Computing Research Repository, 2020. doi: 10.48550/arXiv.2101.00027. URL https://arxiv.org/abs/2101.00027v1. version 1.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Logical induction. arXiv preprint arXiv:1609.03543, 2016.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
- Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680, 2022.
- Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.
- Ian Hacking. Slightly more realistic personal probability. Philosophy of Science, 34(4):311–325, 1967.
- Overthinking the truth: Understanding how language models process false demonstrations. In Submitted to The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=em4xg1Gvxa. under review.
- BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.
- J Hewitt and P Liang. Designing and interpreting probes with control tasks. Proceedings of the 2019 Con, 2019.
- A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019.
- Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
- Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.
- Bert busters: Outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990, 2021.
- A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.
- Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 991–999, 2015.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
- Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008.
- PC Mahalanobis. On the generalized distances in statistics: Mahalanobis distance. Journal Soc. Bengal, 26:541–588, 1936.
- The singular value decompositions of transformer weight matrices are highly interpretable. LessWrong, 2022. URL https://bit.ly/3GdbZoa.
- Neel Nanda. Transformerlens, 2022. URL https://github.com/neelnanda-io/TransformerLens.
- nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
- nostalgebraist. logit lens on non-gpt2 models + extensions, 2021. URL https://colab.research.google.com/drive/1MjdfK2srcerLrAJDRaJQKO0sUiZ-hQtA.
- Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
- OpenAI. Gpt-4 technical report. Technical report, 2023. Technical Report.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- Language models are unsupervised multitask learners. OpenAI Blog, 2019.
- FP Ramsey. Truth and probability. Studies in subjective probability, pages 61–92, 1926.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- BLOOM: A 176B-parameter open-access multilingual language model. Computing Research Repository, 2022. doi: 10.48550/arXiv.2211.05100. URL https://arxiv.org/abs/2211.05100v2. version 2.
- Confident adaptive language modeling. arXiv preprint arXiv:2207.07061, 2022.
- A comparison of semantic similarity methods for maximum human interpretability. 2019 Artificial Intelligence for Transforming Business and Society (AITB), 1:1–4, 2019.
- William Timkey and Marten van Schijndel. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. arXiv preprint arXiv:2109.04404, 2021.
- Together. Redpajama, a project to create leading open-source models, starts by reproducing llama training dataset of over 1.2 trillion tokens, April 2023. URL https://www.together.xyz/blog/redpajama.
- An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
- Llama: Open and efficient foundation language models, 2023.
- What if this modified that? syntactic interventions with counterfactual embeddings. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 862–875, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020.
- Eliezer Yudkowsky. Conservation of expected evidence. https://www.lesswrong.com/posts/jiBFC7DcCrZjGmZnJ/conservation-of-expected-evidence, 2007. Accessed: March 18, 2023.
- Accelerating training of transformer-based language models with progressive layer dropping. arXiv preprint arXiv:2010.13369, 2020.
- OPT: Open pre-trained transformer language models. Computing Research Repository, 2022. doi: 10.48550/arXiv.2205.01068. URL https://arxiv.org/abs/2205.01068v4. version 4.