Quantifying Attention Flow in Transformers
The paper "Quantifying Attention Flow in Transformers" by Samira Abnar and Willem Zuidema introduces novel methodologies aimed at addressing challenges in interpreting the flow of attention in Transformer models. The authors focus on the limitations of attention weights as direct indicators of information flow and propose two alternative approaches: attention rollout and attention flow. These methods offer improved insights into the contribution of input tokens to model decisions across layers.
Problem Context
Traditional interpretations of self-attention in Transformers often equate attention weights with explanations for model decisions. However, as representations are propagated through deeper layers, they become increasingly context-dependent, complicating token identifiability. This paper addresses the inadequacies of raw attention weights in capturing the nuanced flow of information by introducing methods that consider the mixing of token identities across layers.
Proposed Methods: Attention Rollout and Attention Flow
The authors propose two computational approaches to better approximate attention to input tokens using the attention weights:
- Attention Rollout: This method models attention as a linear combination of token identities through layers. It rolls out the weights, adjusting for the presence of residual connections, to map information propagation from input tokens to intermediate embeddings.
- Attention Flow: Using a flow network perspective, this method interprets the attention graph as a Directed Acyclic Graph (DAG) and employs a maximum flow algorithm to compute how information flows from hidden embeddings to input tokens. It views attention weights as capacities in the graph, thereby providing a different lens on information distribution.
Both methods incorporate residual connections to provide a refined view of attention, enhancing correlation with input significance measures obtained via ablation methods and input gradients.
Empirical Analysis
The authors evaluate their methods using a verb number prediction task. A key finding is that traditional raw attention weights exhibit low correlation with input importance scores, notably diminishing beyond the initial layers. In contrast, attention rollout and attention flow demonstrate significantly higher correlations. Attention rollout offers a more focused view, while attention flow provides an amortized distribution of importance across key tokens. Regardless, both methods surpass raw attention in aligning with input gradients and ablation measures.
Implications and Future Directions
The paper advances the interpretability of Transformer architectures by offering more reliable diagnostic tools than raw attention weights. This work is particularly relevant in tasks where understanding token importance is critical. In practical applications, these methodologies could enhance debugging and visualization of model decisions, offering a more nuanced explanation of how models derive predictions.
Future developments could expand on these methods by integrating enhanced attention metrics, such as effective attention weights, and exploring gradient-based adjustments. Additionally, while this paper centers on Transformer encoders, further research might adapt these methods for Transformer decoders, accounting for challenges posed by causal masking.
In summary, Abnar and Zuidema's research provides valuable methodologies for probing the inner workings of Transformer models, offering a richer understanding of attention dynamics that could inform both theoretical exploration and practical applications in advanced sequence modeling.