Introduction
Transformer-based models have drastically advanced performance across various AI tasks. While successful on the surface, the intricacies of how these models arrive at their predictions often remain opaque, a problem which impedes further improvement and trustworthiness. Current interpretability approaches struggle with the increasingly complex structures underlying these models, leaving us with pressing questions regarding parameter significance and the accurate location of knowledge within the network's architecture.
Unveiling the Mysteries of Transformers
The key to understanding transformers is dissecting the so-called residual stream—a pathway where the outputs of different layers interact and accumulate. By exploring the residual stream, this paper deciphers the mechanism behind the connections made between these outputs, revealing a direct addition function that impacts the probabilities associated with prediction outcomes. The probability of a given token increases when its before-softmax value is large.
Assigning Contributions and Probing Layers
To pinpoint influential parameters, this research establishes the use of log probability increase as a metric for quantifying a layer's contribution to a prediction. Leveraging this metric, the paper illuminates how each layer—be it attention or feed-forward neural network (FFN)—supports word predictions. Furthermore, analyzing inner products, the research provides insights into the interplay between preceding layers and how they impact subsequent FFN layers.
Empirical Findings and Methodological Innovations
Empirical analyses on a collection of sampled cases indicate that every layer within transformers plays a role in next-word prediction, with knowledge distributed across both attention and FFN layers. Notably, no single layer or module monopolizes importance; several contribute jointly to predictions. Case studies reinforce these findings, demonstrating that paramount transformer-specific features for prediction may reside in both attention and FFN subvalues. Lastly, the research presents a methodological contribution by showcasing a technique for detailing the influence of preceding layers on upper FFN layers.
Roadmap to Interpretability
The paper promises to release the code on GitHub, which will enable the public to implement these interpretability methods. Through such transparency, it is anticipated that the robust interpretability of transformer-based models will be enhanced significantly.