Attention is Not Not Explanation: An Analytical Perspective
Introduction
In the paper "Attention is not not Explanation," the authors Sarah Wiegreffe and Yuval Pinter address the controversial claim that attention mechanisms in Recurrent Neural Networks (RNNs) do not offer meaningful explanations for model predictions. This rebuttal focuses particularly on the paper "Attention is not Explanation" by Jain and Wallace (2019), critically examining its assumptions and experimental setups. Wiegreffe and Pinter propose alternative methods to evaluate the interpretability of attention mechanisms, providing a rigorous analysis that contributes to ongoing discussions in the field of NLP.
Background and Motivation
The core of the controversy lies in whether attention mechanisms can be used as explanations for model predictions. Jain and Wallace assert that if alternative attention distributions yield similar predictions, the original attention scores cannot be reliably used to explain the model's decision. This premise is based on the assumption that explainability should be consistent and exclusive with respect to other feature-importance measures. However, Wiegreffe and Pinter argue that the definition of explanation is more nuanced and context-dependent.
Methodological Contributions
The paper offers four alternative tests for evaluating attention as explanation:
- Uniform Weights Baseline: A baseline where attention weights are frozen to a uniform distribution.
- Variance Calibration with Random Seeds: Assessing the expected variance in attention weights by training multiple models with different random seeds.
- Diagnostic Framework: Utilizing frozen weights from pretrained models in a non-contextual Multi-Layer Perceptron (MLP) architecture.
- Adversarial Training Protocol: An end-to-end adversarial training protocol that modifies the loss function to consider the distance from the base model's attention scores.
Experimental Analysis and Results
Uniform Weights Baseline
The authors first test whether attention is necessary by comparing models with uniform attention weights against those with learned attention weights. They find that for datasets like AG News and 20 Newsgroups, the attention mechanism offers little to no improvement, indicating that these datasets are not suitable for testing the role of attention in explainability.
Variance with Random Seeds
To assess the normal variance in attention distributions, the authors train multiple models with different random seeds. They show that some datasets, like SST, exhibit robust attention distributions despite random variations, while others, like Diabetes, show significant variability. This highlights the need to consider background stochastic variation when evaluating adversarial results.
Diagnostic Framework
The authors introduce an MLP model guided by pre-trained attention distributions. The results show that attention scores from the original LSTM models are useful and consistent, as they improve the MLP's performance compared to a learned distribution. This supports the notion that attention mechanisms capture meaningful token importance which transcends specific model architectures.
Adversarial Training Protocol
The authors propose a coherent training protocol for adversarial attention distributions, considering both prediction similarity and attention score divergence. Their findings confirm that while adversarial distributions can be found, they perform poorly in guiding simpler models, indicating that trained attention mechanisms do capture essential information about the data.
Implications and Future Directions
The paper's findings have significant theoretical and practical implications for the use of attention mechanisms in NLP. The authors show that attention scores, despite their variability, provide useful insights into model behavior and token importance. These insights challenge the exclusivity requisite assumed by Jain and Wallace, suggesting that multiple valid explanations can coexist.
Future research directions include extending the analysis to other tasks and languages, incorporating human evaluations, and developing theoretical frameworks for estimating the usefulness of attention mechanisms based on dataset and model properties. Additionally, exploring the existence of multiple adversarial attention models can further elucidate the limits and potential of attention as an explanatory tool.
Conclusion
Wiegreffe and Pinter's paper offers a comprehensive critique and alternative evaluation methods for the role of attention in model explainability. By demonstrating that attention mechanisms can provide meaningful explanations under certain conditions, the authors contribute to a more nuanced understanding of explainability in NLP models. Their work serves as a valuable resource for researchers aiming to develop robust and interpretable AI systems.