Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Does Transformer Interpretability Transfer to RNNs? (2404.05971v1)

Published 9 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of LLMing perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer LLMs will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
  2. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
  3. Guy E Blelloch. Prefix sums and their applications. 1990.
  4. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  5. Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit.
  6. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352, 2023.
  7. Btlm-3b-8k: 7b parameter performance in a 3b parameter model. arXiv preprint arXiv:2309.11568, 2023.
  8. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936. doi: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x.
  9. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  10. Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102, 2024.
  11. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  12. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  13. Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.
  14. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023.
  15. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023.
  16. Parallelizing linear recurrent neural nets over sequence length. arXiv preprint arXiv:1709.04057, 2017.
  17. nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
  18. OpenAI. Gpt-4 technical report, 2023.
  19. Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897, 2023.
  20. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  21. Bo Peng et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. 2024.
  22. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022.
  23. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models, 12 2023. URL https://github.com/togethercomputer/stripedhyena.
  24. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  25. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  26. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  27. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  28. Accelerating training of transformer-based language models with progressive layer dropping. arXiv preprint arXiv:2010.13369, 2020.
  29. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Gonçalo Paulo (11 papers)
  2. Thomas Marshall (5 papers)
  3. Nora Belrose (19 papers)
Citations (4)

Summary

  • The paper demonstrates that transformer-based interpretability methods, such as CAA and the tuned lens, effectively transfer to RNN architectures.
  • It introduces 'state steering' for RNNs like Mamba and RWKV, showcasing reduced perplexity across layers and improved control over model responses.
  • Experimental findings reveal that RNNs, utilizing compressed states, can retain and elicit latent predictions, widening the interpretability toolkit for AI models.

Does Transformer Interpretability Transfer to RNNs?

Introduction

The paper by Gonçalo Paulo, Thomas Marshall, and Nora Belrose from EleutherAI explores the adaptability of interpretability methods, initially designed for transformer models, to recurrent neural model (RNN) architectures. They specifically focus on the Mamba and RWKV RNN architectures due to their demonstrated equality or superiority in performance compared to transformer models in certain aspects. Through a series of experiments, the paper examines if techniques such as Contrastive Activation Addition (CAA), the tuned lens, and the elicitation of latent predictions and knowledge can be effectively applied to these RNNs.

Architectures Analyzed

The paper concentrates on two main RNN architectures: Mamba and RWKV. Both architectures are designed for efficiency and performance, circumventing the quadratic complexity of the transformer’s self-attention mechanism.

  • Mamba: Incorporates a causal convolution block and a selective state-space model (SSM) for routing information, significantly enhancing model expressivity.
  • RWKV (Receptance-Weighted Key Value): Utilizes alternating time mix and channel mix modules. RWKV v5 is noted for its "multi-headed" matrix-valued state, suggesting an improvement in handling information compared to its predecessors.

These architectures, available on the HuggingFace Hub, provide a base for investigating the transferability of interpretability methods initially tailored for transformers.

Interpretability Techniques

The authors delve into three primary interpretability techniques:

  1. Contrastive Activation Addition (CAA): They hypothesize that CAA can be effective for RNNs, particularly due to their compressed state potentially facilitating model steering.
  2. The Tuned Lens: Viewing each layer as incrementally refining next-token predictions, the authors aim to shed light on whether this lens applicable to transformers can offer insights into RNNs' operation.
  3. 'Quirky' Models: These models, fine-tuned to produce incorrect outputs under specific conditions, serve to probe the extent to which RNNs retain latent knowledge that can be correctly elicited.

These methods underscore the paper's broader aim to understand if and how the inner workings and behaviors of RNNs can be interpreted vis-a-vis transformers.

Findings and Implications

  • Efficacy Across Models: Most interpretability techniques tested showed considerable effectiveness when applied to RNNs. Particularly noteworthy is the method's success in steering model responses and eliciting latent predictions.
  • State Steering: The paper introduces 'state steering' as a novel variant of CAA for RNNs, exploiting the models' compressed state for more effective behavior control.
  • Tuned Lens Perplexity: The research demonstrates that the tuned lens reveals a systematic decrease in perplexity across layers for both RNN architectures, mirroring findings with transformers.

These results not only confirm the possibility of extending transformer interpretability methods to RNNs but also open avenues for further optimization leveraging RNNs' unique structural characteristics.

Future Directions

The paper acknowledges the potential for in-depth exploration of RNN states for improved interpretability and suggests extending this research to other model categories. The exploration of additional interpretability tools, especially those rooted in mechanistic or circuit-based approaches, is recommended to broaden understanding of model behaviors and to continue improving the efficacy of AI models in real-world applications.

Conclusion

In conclusion, the work by Paulo, Marshall, and Belrose makes a significant contribution to the field of AI by demonstrating that transformer-based interpretability methods can, to a large extent, be applied to RNN architectures. This research not only enhances our understanding of RNN behavior but also broadens the toolkit available for interpreting and steering the outputs of diverse neural network models. The practical and theoretical implications of this research underscore the importance of continued exploration in the field of AI interpretability.

HackerNews