An Investigation of Language Model Interpretability via Sentence Editing (2011.14039v2)

Published 28 Nov 2020 in cs.CL

Abstract: Pre-trained LLMs (PLMs) like BERT are being used for almost all language-related tasks, but interpreting their behavior still remains a significant challenge and many important questions remain largely unanswered. In this work, we re-purpose a sentence editing dataset, where faithful high-quality human rationales can be automatically extracted and compared with extracted model rationales, as a new testbed for interpretability. This enables us to conduct a systematic investigation on an array of questions regarding PLMs' interpretability, including the role of pre-training procedure, comparison of rationale extraction methods, and different layers in the PLM. The investigation generates new insights, for example, contrary to the common understanding, we find that attention weights correlate well with human rationales and work better than gradient-based saliency in extracting model rationales. Both the dataset and code are available at https://github.com/samuelstevens/sentence-editing-interpretability to facilitate future interpretability research.

Authors (2)

Samuel Stevens (17 papers)
Yu Su (138 papers)

Citations (6)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - samuelstevens/sentence-editing-interpretability: Code and data for EMNLP 2021 BlackboxNLP paper "An Investigation of Language Model Interpretability via Sentence Editing" (3 stars)

An Investigation of Language Model Interpretability via Sentence Editing (2011.14039v2)

Summary

Related Papers

GitHub