Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Eliciting Latent Predictions from Transformers with the Tuned Lens (2303.08112v4)

Published 14 Mar 2023 in cs.LG

Abstract: We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique, which yielded useful insights but is often brittle. We test our method on various autoregressive LLMs with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. John Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982.
  2. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
  3. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 8 2021. URL https://www.github.com/eleutherai/gpt-neox.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876–10889, 2021.
  6. Revisiting model stitching to compare neural representations. Advances in Neural Information Processing Systems, 34:225–236, 2021.
  7. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  8. Datasheet for the Pile. Computing Research Repository, 2022. doi: 10.48550/arXiv.2201.07311. URL https://arxiv.org/abs/2201.07311v1. version 1.
  9. Pythia: a scaling suite for language model interpretability research. Computing Research Repository, 2023. doi: 10.48550/arXiv.2201.07311. URL https://arxiv.org/abs/2201.07311v1. version 1.
  10. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58, 2021.
  11. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  12. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
  13. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000.
  14. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  15. Causal scrubbing: a method for rigorously testing interpretability hypotheses. Alignment Forum, 2022. URL https://bit.ly/3WRBhPD.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  17. Similarity and matching of neural network representations. Advances in Neural Information Processing Systems, 34:5656–5668, 2021.
  18. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022.acl-long.581.
  19. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535, 2022.
  20. Foresight: Its logical laws, its subjective sources. Breakthroughs in statistics, 1:134–174, 1937.
  21. Jump to conclusions: Short-cutting transformers with linear transformations. arXiv preprint arXiv:2303.09435, 2023.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  23. Changing the reference measure in the simplex and its weighting effects. Austrian Journal of Statistics, 45(4):25–44, 2016.
  24. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
  25. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  26. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  27. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
  28. The Pile: An 800GB dataset of diverse text for language modeling. Computing Research Repository, 2020. doi: 10.48550/arXiv.2101.00027. URL https://arxiv.org/abs/2101.00027v1. version 1.
  29. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  30. Logical induction. arXiv preprint arXiv:1609.03543, 2016.
  31. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
  32. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  33. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680, 2022.
  34. Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.
  35. Ian Hacking. Slightly more realistic personal probability. Philosophy of Science, 34(4):311–325, 1967.
  36. Overthinking the truth: Understanding how language models process false demonstrations. In Submitted to The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=em4xg1Gvxa. under review.
  37. BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.
  38. J Hewitt and P Liang. Designing and interpreting probes with control tasks. Proceedings of the 2019 Con, 2019.
  39. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019.
  40. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
  41. Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.
  42. Bert busters: Outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990, 2021.
  43. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.
  44. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 991–999, 2015.
  45. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
  46. Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008.
  47. PC Mahalanobis. On the generalized distances in statistics: Mahalanobis distance. Journal Soc. Bengal, 26:541–588, 1936.
  48. The singular value decompositions of transformer weight matrices are highly interpretable. LessWrong, 2022. URL https://bit.ly/3GdbZoa.
  49. Neel Nanda. Transformerlens, 2022. URL https://github.com/neelnanda-io/TransformerLens.
  50. nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
  51. nostalgebraist. logit lens on non-gpt2 models + extensions, 2021. URL https://colab.research.google.com/drive/1MjdfK2srcerLrAJDRaJQKO0sUiZ-hQtA.
  52. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  53. OpenAI. Gpt-4 technical report. Technical report, 2023. Technical Report.
  54. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  55. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  56. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
  57. FP Ramsey. Truth and probability. Studies in subjective probability, pages 61–92, 1926.
  58. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  59. BLOOM: A 176B-parameter open-access multilingual language model. Computing Research Repository, 2022. doi: 10.48550/arXiv.2211.05100. URL https://arxiv.org/abs/2211.05100v2. version 2.
  60. Confident adaptive language modeling. arXiv preprint arXiv:2207.07061, 2022.
  61. A comparison of semantic similarity methods for maximum human interpretability. 2019 Artificial Intelligence for Transforming Business and Society (AITB), 1:1–4, 2019.
  62. William Timkey and Marten van Schijndel. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. arXiv preprint arXiv:2109.04404, 2021.
  63. Together. Redpajama, a project to create leading open-source models, starts by reproducing llama training dataset of over 1.2 trillion tokens, April 2023. URL https://www.together.xyz/blog/redpajama.
  64. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
  65. Llama: Open and efficient foundation language models, 2023.
  66. What if this modified that? syntactic interventions with counterfactual embeddings. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 862–875, 2021.
  67. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  68. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
  69. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  70. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020.
  71. Eliezer Yudkowsky. Conservation of expected evidence. https://www.lesswrong.com/posts/jiBFC7DcCrZjGmZnJ/conservation-of-expected-evidence, 2007. Accessed: March 18, 2023.
  72. Accelerating training of transformer-based language models with progressive layer dropping. arXiv preprint arXiv:2010.13369, 2020.
  73. OPT: Open pre-trained transformer language models. Computing Research Repository, 2022. doi: 10.48550/arXiv.2205.01068. URL https://arxiv.org/abs/2205.01068v4. version 4.
Citations (154)

Summary

  • The paper presents the tuned lens, a method that applies layer-specific affine transformations to improve latent prediction accuracy over the logit lens.
  • It demonstrates causal consistency by revealing influential hidden state directions that mirror final model outputs, enhancing interpretability.
  • The approach effectively detects prompt injection attacks and reduces bias, offering practical benefits for securing and refining large transformer models.

Eliciting Latent Predictions from Transformers with the Tuned Lens: An Overview

The paper "Eliciting Latent Predictions from Transformers with the Tuned Lens" presents a refined method for interpreting the hidden states of transformer models, building upon the earlier "logit lens" technique. The authors introduce the "tuned lens," which decodes each hidden state into a distribution over a predefined vocabulary. This method is aimed at understanding how transformers iteratively refine their predictions layer by layer.

Key Contributions

The tuned lens offers several notable improvements over the logit lens:

  1. Enhanced Predictive Reliability: By training an affine transformation for each layer in a model, the tuned lens achieves greater predictive accuracy compared to the original logit lens method. This improvement is particularly evident across various autoregressive LLMs ranging up to 20B parameters.
  2. Causal Consistency: The paper details empirical investigations showing that the features influential to the tuned lens' predictions are similarly significant to the final model predictions. This consistency was investigated through causal basis extraction, which identifies principal influential directions within the hidden states.
  3. Detection of Malicious Inputs: One practical implication of the method is its application in detecting prompt injection attacks. The trajectory of the latent predictions offers a signature that enables high-accuracy detection of malicious inputs.
  4. Improved Robustness to Bias: The tuned lens demonstrates a more unbiased estimation of the model's final predictions compared to the logit lens. This is particularly useful for understanding the belief update process in transformers.

Empirical Evaluation

The tuned lens method was tested on a suite of transformer models, including GPT-Neo and BLOOM. The results showed that the tuned lens consistently outperformed the logit lens in terms of perplexity and biased estimation. Despite the improved complexity, it remains computationally feasible and was implemented to run efficiently on large datasets.

A detailed exploration of how tuned lenses transfer across layers and model variants adds to its robustness. For instance, the lenses were effectively transferred to fine-tuned models without requiring additional training, retaining their predictive accuracy.

Implications for Future Research

The implications of this research are substantial for both theoretical and practical applications in AI:

  • Theoretical Insights: The method allows for a deeper understanding of how transformers manage and update predictive distributions internally. It may encourage more research into the iterative inference nature of transformer architectures.
  • Practical Applications: By enabling anomaly detection, particularly for identifying prompt injection attacks, this method offers immediate utility in securing transformer applications.
  • Foundation for Further Studies: The paper’s use of causal basis extraction sets a precedent for further exploration into causal interpretability, providing pathways for more robust model interpretation techniques.

Conclusion and Future Directions

In conclusion, the tuned lens presents a significant refinement of earlier techniques for interpreting transformers' inner workings. Future research may focus on scaling these insights to other model architectures, and improving computational efficiencies further. The practical applications, particularly in security, emphasize the broader relevance and necessity of such exploratory techniques in understanding and deploying AI systems effectively.

This contribution is poised to enhance our capacity to interpret complex models and their predictions, informing both the development of new models and the responsible deployment of existing ones in various domains.

Youtube Logo Streamline Icon: https://streamlinehq.com