Finding Transformer Circuits with Edge Pruning (2406.16778v2)

Published 24 Jun 2024 in cs.CL

Abstract: The path to interpreting a LLM often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated circuit discovery as an optimization problem and propose Edge Pruning as an effective and scalable solution. Edge Pruning leverages gradient-based pruning techniques, but instead of removing neurons or components, it prunes the \emph{edges} between components. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks. Edge Pruning is efficient even with as many as 100K examples, outperforming previous methods in speed and producing substantially better circuits. It also perfectly recovers the ground-truth circuits in two models compiled with Tracr. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on. We use this setting for a case study comparing the mechanisms behind instruction prompting and in-context learning. We find two circuits with more than 99.96% sparsity that match the performance of the full model and reveal that the mechanisms in the two settings overlap substantially. Our case study shows that Edge Pruning is a practical and scalable tool for interpretability and sheds light on behaviors that only emerge in large models.

Citations (6)

View on Semantic Scholar

Summary

The paper shows that Edge Pruning efficiently identifies sparse circuits in transformer models by framing circuit discovery as an optimization problem.
Experimental results on tasks like IOI and gendered pronoun identification demonstrate significant edge reduction while maintaining model output fidelity.
Edge Pruning scales to large datasets and multi-billion parameter models, offering a practical tool for deepening transformer interpretability in AI systems.

Finding Transformer Circuits with Edge Pruning

The paper "Finding Transformer Circuits with Edge Pruning" by Adithya Bhaskar et al. introduces a novel approach to interpretability in LLMs, specifically Transformers, by proposing a method named Edge Pruning to discover circuits. Circuits in this context are sparse computational subgraphs within a model that encapsulate specific behaviors of the model. The contribution of this paper lies in framing circuit discovery as an optimization problem and advancing a scalable solution that enhances both the efficiency and faithfulness of identified circuits.

Methodology

Circuit discovery has traditionally relied on either inefficient greedy search algorithms or gradient-based approximations that trade accuracy for computational speed. Edge Pruning diverges from these methods by employing a more nuanced approach to pruning. Unlike previous methods that focus on neurons or model components, Edge Pruning prunes the connections—or edges—between these components. By leveraging gradient descent on edge masks that gauge the importance of each edge, this method effectively converts the continuous masks to binary, thereby determining the presence or absence of each edge in the circuit.

Key to Edge Pruning's innovation is the replacement of the traditional residual stream in Transformers—where activations are added to the stream—with a disentangled residual stream that maintains a list of all prior activations. This disentanglement allows the optimization of edge masks to align with the model's specific behaviors, facilitated by techniques like $L_0$ regularization to produce an optimal circuit.

Experimental Validation

The efficacy of Edge Pruning is demonstrated through a series of experiments on multiple tasks using the GPT-2 Small model. Notably, Edge Pruning is benchmarked against prior methods such as ACDC and Edge Attribution Patching (EAP) on tasks including Indirect Object Identification (IOI), Greater Than (GT), and Gendered Pronoun (GP). The paper meticulously measures the faithfulness of circuits—how accurately they reflect the full model’s outputs—using metrics like KL divergence and specific task performance metrics.

The results showcase that Edge Pruning not only matches but often surpasses the fidelity and performance of previous methods, especially on tasks with higher complexity like multi-template IOI and GT. For instance, circuits identified on the IOI task using Edge Pruning had less than half the number of edges while maintaining equivalent faithfulness to the full model’s predictions.

The method's scalability is another highlight. Edge Pruning efficiently handles datasets with up to 100,000 examples, a scale at which previous methods like ACDC become computationally prohibitive. Furthermore, Edge Pruning can be applied to multi-billion parameter models such as CodeLlama-13B, working over 100 times the data scale typically tackled by alternative circuit discovery methods. The case paper on CodeLlama-13B reveals that circuits with over 99.96% sparsity still effectively mirror the full model’s performance, thereby underscoring the method's scalability.

Theoretical and Practical Implications

The implications of these findings are profound for both theoretical research and practical applications in AI model interpretability. Theoretically, Edge Pruning pushes the boundaries of understanding transformer-based models by facilitating the paper of the finer granularity in model behaviors. Practically, this method equips researchers with a scalable tool to interpret models at multiple levels of granularity without compromising on efficiency or faithfulness. This can be particularly valuable in large-scale applications where understanding model decisions is critical for deploying AI systems securely and responsibly.

Furthermore, the successful application to models like CodeLlama-13B opens avenues for deeper investigations into mechanisms behind sophisticated AI behaviors, such as instruction prompting and in-context learning. This can foster the development of more nuanced interpretation frameworks that extend beyond existing state-of-the-art tools.

Future Directions

While Edge Pruning represents a significant advancement, the paper recognizes several limitations and areas for future enhancement. Combining Edge Pruning with fast, approximate methods like EAP could strike a balance between computational efficiency and interpretability performance. Additionally, despite the high faithfulness observed, the circuits identified by Edge Pruning may still miss backup components essential for specific tasks, an issue that more advanced faithfulness metrics could address.

Finally, automating the interpretation of the identified circuits remains a critical challenge. Future work could leverage automated interpretability techniques to provide more accessible insights into these complex subgraphs, further democratizing the understanding of intricate AI models.

Conclusion

In conclusion, the paper introduces Edge Pruning as a robust method for circuit discovery in transformer models. This method not only outperforms existing techniques in terms of faithfulness and sparsity but also demonstrates remarkable scalability to large datasets and models. As a significant contribution to the field of AI interpretability, Edge Pruning holds promise for furthering our understanding of transformer models and aiding their responsible deployment in diverse applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AdithyaNLP/status/1805601145198379365

YouTube

Show All Videos