Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quantifying Attention Flow in Transformers (2005.00928v2)

Published 2 May 2020 in cs.LG, cs.AI, and cs.CL

Abstract: In the Transformer model, "self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer. Thus, across layers of the Transformer, information originating from different tokens gets increasingly mixed. This makes attention weights unreliable as explanations probes. In this paper, we consider the problem of quantifying this flow of information through self-attention. We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow, as post hoc methods when we use attention weights as the relative relevance of the input tokens. We show that these methods give complementary views on the flow of information, and compared to raw attention, both yield higher correlations with importance scores of input tokens obtained using an ablation method and input gradients.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Samira Abnar (19 papers)
  2. Willem Zuidema (32 papers)
Citations (690)

Summary

Quantifying Attention Flow in Transformers

The paper "Quantifying Attention Flow in Transformers" by Samira Abnar and Willem Zuidema introduces novel methodologies aimed at addressing challenges in interpreting the flow of attention in Transformer models. The authors focus on the limitations of attention weights as direct indicators of information flow and propose two alternative approaches: attention rollout and attention flow. These methods offer improved insights into the contribution of input tokens to model decisions across layers.

Problem Context

Traditional interpretations of self-attention in Transformers often equate attention weights with explanations for model decisions. However, as representations are propagated through deeper layers, they become increasingly context-dependent, complicating token identifiability. This paper addresses the inadequacies of raw attention weights in capturing the nuanced flow of information by introducing methods that consider the mixing of token identities across layers.

Proposed Methods: Attention Rollout and Attention Flow

The authors propose two computational approaches to better approximate attention to input tokens using the attention weights:

  1. Attention Rollout: This method models attention as a linear combination of token identities through layers. It rolls out the weights, adjusting for the presence of residual connections, to map information propagation from input tokens to intermediate embeddings.
  2. Attention Flow: Using a flow network perspective, this method interprets the attention graph as a Directed Acyclic Graph (DAG) and employs a maximum flow algorithm to compute how information flows from hidden embeddings to input tokens. It views attention weights as capacities in the graph, thereby providing a different lens on information distribution.

Both methods incorporate residual connections to provide a refined view of attention, enhancing correlation with input significance measures obtained via ablation methods and input gradients.

Empirical Analysis

The authors evaluate their methods using a verb number prediction task. A key finding is that traditional raw attention weights exhibit low correlation with input importance scores, notably diminishing beyond the initial layers. In contrast, attention rollout and attention flow demonstrate significantly higher correlations. Attention rollout offers a more focused view, while attention flow provides an amortized distribution of importance across key tokens. Regardless, both methods surpass raw attention in aligning with input gradients and ablation measures.

Implications and Future Directions

The paper advances the interpretability of Transformer architectures by offering more reliable diagnostic tools than raw attention weights. This work is particularly relevant in tasks where understanding token importance is critical. In practical applications, these methodologies could enhance debugging and visualization of model decisions, offering a more nuanced explanation of how models derive predictions.

Future developments could expand on these methods by integrating enhanced attention metrics, such as effective attention weights, and exploring gradient-based adjustments. Additionally, while this paper centers on Transformer encoders, further research might adapt these methods for Transformer decoders, accounting for challenges posed by causal masking.

In summary, Abnar and Zuidema's research provides valuable methodologies for probing the inner workings of Transformer models, offering a richer understanding of attention dynamics that could inform both theoretical exploration and practical applications in advanced sequence modeling.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com