EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models (2505.01238v1)

Published 2 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: As NLP models continue to evolve and become integral to high-stakes applications, ensuring their interpretability remains a critical challenge. Given the growing variety of explainability methods and diverse stakeholder requirements, frameworks that help stakeholders select appropriate explanations tailored to their specific use cases are increasingly important. To address this need, we introduce EvalxNLP, a Python framework for benchmarking state-of-the-art feature attribution methods for transformer-based NLP models. EvalxNLP integrates eight widely recognized explainability techniques from the Explainable AI (XAI) literature, enabling users to generate and evaluate explanations based on key properties such as faithfulness, plausibility, and complexity. Our framework also provides interactive, LLM-based textual explanations, facilitating user understanding of the generated explanations and evaluation outcomes. Human evaluation results indicate high user satisfaction with EvalxNLP, suggesting it is a promising framework for benchmarking explanation methods across diverse user groups. By offering a user-friendly and extensible platform, EvalxNLP aims at democratizing explainability tools and supporting the systematic comparison and advancement of XAI techniques in NLP.

Summary

The paper introduces EvalxNLP, a framework supporting eight gradient-based and perturbation-based feature attribution methods for evaluating NLP model explanations.
EvalxNLP evaluates explainability methods based on faithfulness, plausibility, and complexity using established metrics and can generate natural language summaries via an LLM.
A case study shows no single method dominates across metrics, emphasizing that selecting an explainability technique should align with specific evaluation criteria.

EvalxNLP: An Evaluation Framework for NLP Explainability

The paper "EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models" introduces the EvalxNLP framework, which addresses the increasingly critical need for effective evaluation of post-hoc explainability methods applied to NLP models. This is particularly relevant given the opacity of transformer-based models in high-stakes domains such as healthcare and finance, where model interpretability is crucial for trust and accountability.

EvalxNLP supports eight feature attribution methods, both gradient-based and perturbation-based, thus providing a thorough basis for evaluating and benchmarking transformer model explanations. Gradient-based approaches such as Integrated Gradients, Saliency, and DeepLIFT are implemented using Captum, while perturbation-based methods like LIME and SHAP are integrated directly or via existing libraries. These methods are assessed based on three pivotal properties: faithfulness, plausibility, and complexity. Faithfulness metrics include Soft sufficiency, Soft comprehensiveness, FAD N-AUC, and AUC-TP, ensuring that explanations accurately reflect model behavior. Plausibility is evaluated through metrics like IOU-F1 Score and AUPRC, while complexity is measured via Shannon entropy and sparsity metrics.

The framework further enhances comprehensibility by generating natural language explanations through an LLM-based module. This integration addresses the difficulty lay users face in interpreting raw importance scores, providing textual summaries of feature attributions and evaluation metrics. This utility is underscored by the human evaluation component, wherein user feedback indicated high satisfaction with the framework's usability and interpretability, particularly among those with less NLP experience.

EvalxNLP's case paper on sentiment analysis using Movie Reviews dataset exemplifies its capability to benchmark feature attribution methods across text classification tasks. Results demonstrated differential efficacy of explanation methods, highlighting DeepLIFT's superior faithfulness and SHAP's alignment with human intuition, albeit no single method dominates across all metrics. This underscores the necessity for selecting explainability techniques tailored to specific evaluation criteria pertinent to the user's context.

Despite the framework's accomplishments, limitations persist, particularly with respect to the scope restricted to text classification and feature attribution methods. Future developments could include support for diverse NLP tasks and incorporation of non-feature attribution methods, such as \cite{slalom-2025}, coupled with broader robustness metrics like sensitivity \cite{sensitivity-infidelity-2019}.

EvalxNLP represents a significant contribution to the arsenal of tools available for XAI (Explainable AI) in NLP, democratizing access and encouraging systematic improvements in model interpretability. Its extensibility allows researchers and developers to continually refine and adapt it, ensuring its relevance in advancing transparent and trustworthy AI systems.

EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models (2505.01238v1)

Summary

EvalxNLP: An Evaluation Framework for NLP Explainability

Related Papers