Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Faithful Explanations for Text Classification with Robustness Improvement and Explanation Guided Training (2312.17591v1)

Published 29 Dec 2023 in cs.CL

Abstract: Feature attribution methods highlight the important input tokens as explanations to model predictions, which have been widely applied to deep neural networks towards trustworthy AI. However, recent works show that explanations provided by these methods face challenges of being faithful and robust. In this paper, we propose a method with Robustness improvement and Explanation Guided training towards more faithful EXplanations (REGEX) for text classification. First, we improve model robustness by input gradient regularization technique and virtual adversarial training. Secondly, we use salient ranking to mask noisy tokens and maximize the similarity between model attention and feature attribution, which can be seen as a self-training procedure without importing other external information. We conduct extensive experiments on six datasets with five attribution methods, and also evaluate the faithfulness in the out-of-domain setting. The results show that REGEX improves fidelity metrics of explanations in all settings and further achieves consistent gains based on two randomization tests. Moreover, we show that using highlight explanations produced by REGEX to train select-then-predict models results in comparable task performance to the end-to-end method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In Proc. of ICLR.
  2. David Alvarez-Melis and Tommi S. Jaakkola. 2018. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049.
  3. A diagnostic study of explainability techniques for text classification. In Proc. of EMNLP.
  4. How to explain individual classification decisions. J. Mach. Learn. Res.
  5. ”will you find these shortcuts?” A protocol for evaluating the faithfulness of input salience methods for text classification. In Proc. of EMNLP.
  6. Self-training with few-shot rationalization: Teacher explanations aid student in few-shot NLU. In Proc. of EMNLP.
  7. Can I trust the explainer? verifying post-hoc explanatory methods. arXiv preprint arXiv:1910.02065.
  8. Make up your mind! adversarial generation of inconsistent natural language explanations. In Proc. of ACL.
  9. George Chrysostomou and Nikolaos Aletras. 2021a. Enjoy the salience: Towards better transformer-based faithful explanations with word salience. In Proc. of EMNLP.
  10. George Chrysostomou and Nikolaos Aletras. 2021b. Improving the faithfulness of attention-based explanations with task-specific information for text classification. In Proc. of ACL.
  11. George Chrysostomou and Nikolaos Aletras. 2022. An empirical study on explanations in out-of-domain settings. Proc. of ACL.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL.
  13. ERASER: A benchmark to evaluate rationalized NLP models. In Proc. of ACL.
  14. Shuoyang Ding and Philipp Koehn. 2021. Evaluating saliency methods for neural language models. In Proc. of NAACL.
  15. Harris Drucker and Yann LeCun. 1992. Improving generalization performance using double backpropagation. IEEE Trans. Neural Networks.
  16. Interpretation of neural networks is fragile. In Proc. of AAAI.
  17. Patient risk assessment and warning symptom detection using deep attention-based neural networks. In EMNLP workshop LOUHI 2018.
  18. Explaining and harnessing adversarial examples. In Proc. of ICLR.
  19. Xiaochuang Han and Yulia Tsvetkov. 2021. Influence tuning: Demoting spurious correlations via instance attribution and instance-driven updates. In Proc. of EMNLP Findings.
  20. Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? A formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201.
  21. Pretrained transformers improve out-of-distribution robustness. In Proc. of ACL.
  22. Fooling neural network interpretations via adversarial model manipulation. In Proc. of NeurIPS.
  23. Evaluations and methods for explanation through robustness analysis. In Proc. of ICLR.
  24. Improving deep learning interpretability by saliency guided training. In Proc. of NeurIPS.
  25. Fooling explanations in text classifiers. In Proc. of ICLR.
  26. Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proc. of ACL.
  27. Learning to faithfully rationalize by construction. In Proc. of ACL.
  28. Sahil Jayaram and Emily Allaway. 2021. Human rationales as attribution priors for explainable stance detection. In Proc. of EMNLP.
  29. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437.
  30. Enhancing multiple-choice machine reading comprehension by punishing illogical interpretations. In Proc. of EMNLP.
  31. Logic traps in evaluating attribution scores. In Proc. of ACL.
  32. The (un)reliability of saliency methods. Explainable AI 2019.
  33. Investigating the influence of noise and distractors on the interpretation of neural networks. arXiv preprint arXiv:1611.07270.
  34. Unifying model explainability and robustness for joint text classification and rationale extraction. Proc. of AAAI.
  35. Tracr: Compiled Transformers as a Laboratory for Interpretability. arXiv preprint arXiv:2301.05062.
  36. Zachary C. Lipton. 2018. The mythos of model interpretability. ACM Queue.
  37. Frederick Liu and Besim Avci. 2019. Incorporating priors with feature attribution on text classification. In Proc. of ACL.
  38. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  39. Learning word vectors for sentiment analysis. In Proc. of ACL.
  40. Is sparse attention more interpretable? In Proc. of ACL.
  41. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv: 1507.00677.
  42. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence.
  43. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proc. of EMNLP: System Demonstrations.
  44. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proc. of EMNLP.
  45. Beyond accuracy: Behavioral testing of NLP models with checklist. In Proc. of ACL.
  46. Interpretations are useful: Penalizing explanations to align neural networks with prior knowledge. In Proc. of ICML.
  47. Andrew Slavin Ross and Finale Doshi-Velez. 2018. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proc. of AAAI.
  48. Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proc. of ACL.
  49. Learning important features through propagating activation differences. In Proc. of ICML.
  50. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713.
  51. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proc. of ICLR.
  52. Perturbing inputs for fragile interpretations in deep natural language processing. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP.
  53. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825.
  54. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of EMNLP.
  55. Supervising model attention with human explanations for robust natural language inference. In Proc. of AAAI.
  56. Axiomatic attribution for deep networks. In Proc. of ICML.
  57. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
  58. Identifying the source of vulnerability in explanation discrepancy: A case study in neural text classification. In EMNLP BlackboxNLP 2022.
  59. Measure and improve robustness in NLP models: A survey. arXiv preprint arXiv:2112.08313.
  60. Robust models are more interpretable because attributions look normal. In Proc. of ICML.
  61. Measuring association between labels and free-text rationales. In Proc. of EMNLP.
  62. Can explanations be useful for calibrating black box models? In Proc. of ACL.
  63. On the (in)fidelity and sensitivity of explanations. In Proc. of NeurIPS.
  64. On the faithfulness measurements for model interpretations. arXiv preprint arXiv:2104.08782.
  65. On the lack of robust interpretability of neural text classifiers. In Proc. of ACL Findings.
  66. Character-level convolutional networks for text classification. In Proc. of NeurIPS.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Dongfang Li (46 papers)
  2. Baotian Hu (67 papers)
  3. Qingcai Chen (36 papers)
  4. Shan He (23 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.