A Survey on Neural Network Interpretability (2012.14261v3)

Published 28 Dec 2020 in cs.LG and cs.AI

Abstract: Along with the great success of deep neural networks, there is also growing concern about their black-box nature. The interpretability issue affects people's trust on deep learning systems. It is also related to many ethical problems, e.g., algorithmic discrimination. Moreover, interpretability is a desired property for deep networks to become powerful tools in other research fields, e.g., drug discovery and genomics. In this survey, we conduct a comprehensive review of the neural network interpretability research. We first clarify the definition of interpretability as it has been used in many different contexts. Then we elaborate on the importance of interpretability and propose a novel taxonomy organized along three dimensions: type of engagement (passive vs. active interpretation approaches), the type of explanation, and the focus (from local to global interpretability). This taxonomy provides a meaningful 3D view of distribution of papers from the relevant literature as two of the dimensions are not simply categorical but allow ordinal subcategories. Finally, we summarize the existing interpretability evaluation methods and suggest possible research directions inspired by our new taxonomy.

Authors (4)

Aleš Leonardis (25 papers)
Yu Zhang (1400 papers)
Peter Tiňo (35 papers)
Ke Tang (107 papers)

Citations (560)

View on Semantic Scholar

Summary

A Survey on Neural Network Interpretability

The paper "A Survey on Neural Network Interpretability" by Yu Zhang, Peter Tiňo, Aleš Leonardis, and Ke Tang presents a comprehensive review of neural network interpretability research. This field has gained momentum due to increasing concerns about the opaque nature of deep neural networks (DNNs) and their implications for trust, ethics, and applicability in sensitive domains like medicine and finance. The paper methodically demystifies the concept of interpretability, critically examines existing methodologies, and suggests future research directions under a novel taxonomy.

Definitions and Importance

The authors emphasize that interpretability is often ambiguously defined across different studies. Here, interpretability is operationally defined as the ability of a model to provide explanations in understandable terms to humans. Such explanations are critical for high-stakes scenarios requiring reliability, fairness, and compliance with legislative mandates.

Proposed Taxonomy

The authors introduce a three-dimensional taxonomy to categorize interpretability methods:

Passive vs. Active Approaches
Type of Explanation—categorical types including logic rules, hidden semantics, attribution, and explanation by examples.
Local vs. Global Interpretability—reflects the scope within the input space that a method aims to explain.

Passive Approaches

Passive methods dissect trained networks to extract insights:

Rule Extraction: Decompositional approaches leverage network specifics like weights, while pedagogical methods learn rule sets directly from network outputs. The objective is often global interpretability, but emerging methods target local explanations.
Hidden Semantics: Visualization techniques and alignment with known concepts are employed to understand neuron roles. This is predominantly applied in computer vision.
Attribution: Gradient-based and model-agnostic approaches define feature importance. Techniques vary from saliency maps to Shapley values. Some methods provide multilevel local-global interpretability.
Example-based: Methods such as influence functions measure the impact of training instances on predictions, focusing on local interpretability.

Active Approaches

Active methods alter training processes to embed interpretability:

Rule-based: Tree regularization imposes structure favoring decision tree-like behavior for global insights.
Hidden Semantics: Techniques encourage disentanglement in feature maps, aiding interpretability.
Attribution and Prototypes: Optimize models for transparent feature importance or intuitive prototypes, facilitating global understanding.

Implications and Future Directions

The taxonomy reveals areas needing exploration, notably active methods. Combining domain knowledge with interpretability techniques could bolster explanation quality. Moreover, refining evaluation criteria to include practical human-centered metrics would align research with real-world applications.

Conclusion

This paper lays the foundation for a structured examination of neural network interpretability, providing clarity and direction for future research. By systematically organizing the existing literature and suggesting paths for innovation, it significantly enhances our understanding of both the challenges and opportunities in making deep networks more transparent.