Tracr: Compiled Transformers as a Laboratory for Interpretability (2301.05062v5)

Published 12 Jan 2023 in cs.LG, cs.AI, and stat.ML

Abstract: We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.

Citations (58)

View on Semantic Scholar

Summary

The paper introduces Tracr, a compiler that transforms RASP programs into transformer weights to create neural models with predetermined computational structures.
The methodology involves constructing computational graphs, inferring values, and assembling weights, resulting in compressed models that retain performance.
The work establishes a benchmark for interpretability research, enabling precise testing of techniques like classifier probes and causal tracing in transformer architectures.

Overview of "Tracr: Compiled Transformers as a Laboratory for Interpretability"

The paper introduces Tracr, a compiler designed to transform high-level, human-readable programs into the weights of decoder-only transformer models, with a focus on enhancing interpretability research. Tracr processes code written in the "Restricted Access Sequence Processing Language" (RASP), converting them into neural networks with known computational structures. This known structure provides ground-truth for the paper of interpretability methods.

Motivation and Methodology

Interpreting the inner workings of LLMs is challenging due to the absence of ground-truth explanations. Addressing this issue, Tracr constructs transformer models where every computational step is predetermined by the RASP program. By designing experiments on these pre-compiled models, researchers can explore various phenomena, such as the superposition of features in transformers, and evaluate different interpretability techniques.

Tracr translates RASP programs into transformer weights through several steps:

Computational Graph Construction: Tracing the program to define a computational graph.
Value Inference: Determining the possible outputs for each node in this graph.
Component Translation: Independently converting each node into corresponding MLP and attention blocks.
Layer Assignment: Allocating components to transformer layers.
Model Construction: Building the transformer model from the blocks.
Weight Assembly: Finalizing the translation into concrete model weights.

The compiler ensures models are constructed with efficiency in mind, often resulting in minimal layers while maintaining interpretability.

Strong Numerical Findings

Experiments using Tracr reveal that models can be significantly compressed while retaining performance, highlighting the nuanced practice of superposition. For instance, a compiled model computing the fraction of a specific token showed equivalent performance with a reduction from 14-dimensional to 6-dimensional residual streams. The compression focused on retaining crucial information, confirming that lesser important features are effectively represented in superposition.

Implications for Interpretability Research

Tracr contributes to interpretability research by providing a tool to create benchmark test cases. This is crucial for evaluating interpretability methods like classifier probes and causal tracing. The generated models present a rich framework for simulating neural networks with traceable logic, enabling researchers to paper the functionality of complex circuits with precision.

Further, these compiled models offer educational insights into the algorithmic capabilities of transformers, illustrating how specific computations can be achieved within transformer architectures.

Limitations and Future Directions

Tracr, built on RASP, has limitations in expressivity, particularly with probabilistic computation and numerical attention patterns. Addressing these constraints could involve extending RASP's capabilities, thereby broadening the range of algorithms that can be implemented.

Future research may focus on refining the compression techniques to produce models that better simulate learned neural networks, potentially by introducing more sophisticated optimization or matrix factorization approaches.

Lastly, while current results validate the utility of compiled models for interpretability, further experimentation is needed to confirm their applicability across varied transformer applications, especially as transformer architectures continue to evolve.

In conclusion, Tracr introduces a novel approach to producing interpretable transformer models, setting the stage for more rigorous evaluation of interpretability techniques and offering new avenues for AI research that bridges theory with empirical practice.