On the Suitability of Attention and Saliency Methods for Model Explanation
The paper by Jasmijn Bastings and Katja Filippova titled "The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?" addresses the ongoing debate over the use of attention mechanisms as explanatory tools within neural network models, particularly in the domain of NLP. The authors challenge the recent inclination towards utilizing attention weights for interpretability and advocate for a more considered focus on saliency methods.
Core Argument and Objective
The primary argument put forth by the authors is centered around the purpose of model explanations, specifically for determining which input features are pivotal to a model's prediction. The manuscript suggests that while attention mechanisms, such as those introduced by Bahdanau et al. (2015), provide weights for input tokens, they do not comprehensively explicate the decision-making process of models in a faithful manner, particularly when the intended user is a model developer concerned with the fidelity of the explanations. Instead, saliency methods, which are designed to attribute relevance to input features concerning the output, are deemed more suitable for this task.
Examination of Attention
The discourse on attention as an explanation pivots around its ability to articulate the influence of each input token on a model's prediction. Papers like those by Jain and Wallace (2019), Serrano and Smith (2019), and Wiegreffe and Pinter (2019) delve into this by evaluating the faithfulness of attention weights as indicators of feature importance. These studies reveal that attention-based explanations often diverge from gradient-based importance measures and can be altered without affecting model predictions, suggesting their limitation in capturing causal relationships between inputs and outputs.
Saliency Methods
Bastings and Filippova review various saliency methods, such as gradient-based methods, Layer-Wise Relevance Propagation (LRP), and occlusion-based techniques. Each method is discussed for its potential to provide a more direct measure of input relevance:
- Gradient-Based Methods: These methods utilize derivatives to determine the sensitivity of the model's output to changes in input features. Notably, Integrated Gradients address the saturation problem that can arise with vanilla gradients.
- Layer-Wise Relevance Propagation: This approach redistributes relevance scores layer-by-layer from the output back to the input, offering explainability by highlighting the contribution of each input across the network.
- Occlusion-Based Techniques: These involve systematically occluding parts of the input to observe changes in the output, thereby gauging feature importance based on resulting variability in model predictions.
Comparative Analysis
The authors make a compelling case for the supremacy of saliency methods over attention for explanation purposes. Saliency methods inherently aim to provide transparent accounts of how input features affect predictions, taking into account the entire network's computation path. In contrast, attention weights represent a snapshot that doesn't necessarily reflect the cumulative reasoning process of the entire model architecture.
Implications and Future Directions
The implications of moving the focus away from attention to saliency methods are substantial for developing more interpretable AI systems. By emphasizing model faithfulness, the research community can progress towards designing systems that provide more accurate and informative explanations. This shift could also pave the way for innovative techniques that integrate multiple dimensions of interpretability, potentially enhancing model transparency and trustworthiness across various applications.
Conclusion
In summary, Bastings and Filippova advocate for a recalibration of the focus in interpretability research. While they acknowledge the utility of attention mechanisms in specific contexts, their examination reaffirms that saliency methods are more aligned with the objectives of transparency and faithfulness required by model developers. This paper calls for a nuanced appreciation of interpretability objectives and highlights the need for clearly articulated goals in future research endeavors within AI.