Towards Robust Interpretability with Self-Explaining Neural Networks (1806.07538v2)

Published 20 Jun 2018 in cs.LG and stat.ML

Abstract: Most recent work on interpretability of complex machine learning models has focused on estimating $\textit{a posteriori}$ explanations for previously trained models around specific predictions. $\textit{Self-explaining}$ models where interpretability plays a key role already during learning have received much less attention. We propose three desiderata for explanations in general -- explicitness, faithfulness, and stability -- and show that existing methods do not satisfy them. In response, we design self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models. Faithfulness and stability are enforced via regularization specifically tailored to such models. Experimental results across various benchmark datasets show that our framework offers a promising direction for reconciling model complexity and interpretability.

Citations (859)

View on Semantic Scholar

Summary

The paper introduces SNNs that embed interpretability into model training to provide clear, explicit, and reliable explanations.
The paper proposes an architecture with input-dependent generalized coefficients and concept-based features to guarantee local linearity.
The paper demonstrates competitive accuracy and enhanced robustness through empirical evaluations on datasets like Mnist and Compas.

Towards Robust Interpretability with Self-Explaining Neural Networks

Introduction

The increasing prominence of machine learning in decision-critical domains, such as healthcare and legal systems, mandates the necessity for interpretable models. This paper by Alvarez-Melis and Jaakkola presents a novel framework for interpretability in machine learning models: Self-Explaining Neural Networks (SNNs). Unlike post-hoc interpretability methods, these models incorporate interpretability intrinsically during the training phase.

Desiderata for Interpretability

Three key desiderata set the foundation for this model framework:

Explicitness - The ability of the model to provide clear and unambiguous explanations.
Faithfulness - Ensuring that the explanations accurately represent the model's functioning.
Stability - Similar inputs should yield similar explanations.

The authors claim that most existing interpretability frameworks fail to satisfy these criteria collectively, driving the need for a new class of models—SNNs.

Architecture and Regularization

The paper systematically constructs SNNs, starting from linear models and generalizing to non-linear ones while preserving the interpretability features of linear models. The fundamental construct of SNNs is:

Generalized Coefficients: The coefficients in these models are functions instead of constants and depend on the input features themselves, ensuring local linearity and interpretability.
Feature Basis: SNNs use higher-level features, referred to as 'concepts', in place of raw input features. These concepts could range from pre-defined feature transformations to learned representations.
General Aggregation: Instead of simple summation, a general aggregation function that preserves the interpretability of the coefficients is used.

The framework is grounded with a regularization technique that enforces the stability of the explanations, ensuring that slight changes in input do not drastically alter the model's interpretability.

Empirical Evaluation

The empirical assessment focused on four main criteria: performance, explicitness, faithfulness, and stability. The experiments utilized datasets such as Mnist, UCI, and the Compas recidivism dataset. Some key findings include:

Performance: SNNs perform competitively with their non-interpretable counterparts, showing minimal degradation in accuracy.
Explicitness: The framework allows for clear and understandable explanations, verified through visual inspection of the Mnist dataset.
Faithfulness: Quantitative metrics demonstrated that relevance scores from SNNs closely correspond to true feature importance.
Stability: SNNs exhibited significantly higher robustness compared to other interpretability frameworks, showcased through adversarial examples.

Implications and Future Directions

Practical Implications

Transparency in Decision-Critical Domains: The inherent interpretability of SNNs is particularly beneficial for applications in healthcare and the criminal justice system, where understanding model predictions is crucial for trust and acceptance.
Robustness and Reliability: The stability of explanations ensures that end-users can trust the consistency of the model, even with slight alterations in the input.

Theoretical Implications

Foundation for Future Models: This work sets a precedent for integrating interpretability directly into the model architecture rather than relying on post-hoc methods.
Guide for Interpretability Metrics: The desiderata proposed—explicitness, faithfulness, and stability—serve as a comprehensive framework for evaluating the interpretability of future models.

Speculation on Future Developments

Extended Application Domains: Future work could apply the principles laid out in this paper to more complex datasets like speech recognition and natural language processing.
Advanced Concept Learning Techniques: The development of more sophisticated methods for learning interpretable basis concepts could further enhance the clarity and usefulness of explanations.
User Studies for Interpretability: Conducting user studies to understand how these models are perceived by end-users could provide deeper insights into the practical utility of SNNs.

In conclusion, this paper proposes a highly structured framework for building interpretable models that retain high performance. By ensuring explanations are explicit, faithful, and stable, SNNs represent a significant step towards achieving robust interpretability in machine learning. The implications of this work extend to both practical applications and the future development of AI interpretable systems.

PDF Markdown