Does Transformer Interpretability Transfer to RNNs? (2404.05971v1)
Abstract: Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of LLMing perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer LLMs will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.
- Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
- Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
- Guy E Blelloch. Prefix sums and their applications. 1990.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit.
- Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352, 2023.
- Btlm-3b-8k: 7b parameter performance in a 3b parameter model. arXiv preprint arXiv:2309.11568, 2023.
- R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936. doi: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.
- Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023.
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023.
- Parallelizing linear recurrent neural nets over sequence length. arXiv preprint arXiv:1709.04057, 2017.
- nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
- OpenAI. Gpt-4 technical report, 2023.
- Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897, 2023.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- Bo Peng et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. 2024.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022.
- StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models, 12 2023. URL https://github.com/togethercomputer/stripedhyena.
- Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Accelerating training of transformer-based language models with progressive layer dropping. arXiv preprint arXiv:2010.13369, 2020.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
- Gonçalo Paulo (11 papers)
- Thomas Marshall (5 papers)
- Nora Belrose (19 papers)