Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Fine-grained Interpretability Evaluation Benchmark for Neural NLP (2205.11097v2)

Published 23 May 2022 in cs.CL

Abstract: While there is increasing concern about the interpretability of neural models, the evaluation of interpretability remains an open problem, due to the lack of proper evaluation datasets and metrics. In this paper, we present a novel benchmark to evaluate the interpretability of both neural models and saliency methods. This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension, each provided with both English and Chinese annotated data. In order to precisely evaluate the interpretability, we provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We also design a new metric, i.e., the consistency between the rationales before and after perturbations, to uniformly evaluate the interpretability on different types of tasks. Based on this benchmark, we conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability. We will release this benchmark https://www.luge.ai/#/luge/task/taskDetail?taskId=15 and hope it can facilitate the research in building trustworthy systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Lijie Wang (23 papers)
  2. Yaozong Shen (2 papers)
  3. Shuyuan Peng (2 papers)
  4. Shuai Zhang (319 papers)
  5. Xinyan Xiao (41 papers)
  6. Hao Liu (497 papers)
  7. Hongxuan Tang (8 papers)
  8. Ying Chen (333 papers)
  9. Hua Wu (191 papers)
  10. Haifeng Wang (194 papers)
Citations (20)

Summary

We haven't generated a summary for this paper yet.