Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) (1711.11279v5)

Published 30 Nov 2017 in stat.ML

Abstract: The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of "zebra" is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application.

Authors (7)

Been Kim (54 papers)
Martin Wattenberg (39 papers)
Justin Gilmer (39 papers)
Carrie Cai (5 papers)
James Wexler (15 papers)
Rory Sayres (10 papers)
Fernanda Viegas (6 papers)

Citations (1,656)

View on Semantic Scholar

Summary

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Introduction

The paper "Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)" addresses the challenge of interpreting deep learning models by proposing a method that moves beyond traditional feature attribution. The authors introduce Concept Activation Vectors (CAVs) and a new technique called Testing with CAVs (TCAV). This approach aims to quantify the sensitivity of a model's predictions to high-level human-understandable concepts. The framework is structured to provide more meaningful interpretability that is accessible, customizable, plug-in ready, and provides a global quantification of the model's behavior.

Concept Activation Vectors (CAVs) and TCAV Framework

CAVs represent high-dimensional internal states of a neural network in terms of user-defined concepts. Typically, high-level human concepts are poorly aligned with the low-level features (like pixel values) used by models such as image classifiers. The fundamental idea is to construct vectors in the internal activation space of a neural network that correspond to these high-level concepts defined by sets of example inputs. The method begins with defining a concept via a dataset and then differentiating the activations produced by the concept’s examples from random examples using a linear classifier.

Building on CAVs, the TCAV technique utilizes directional derivatives to determine the importance of a concept to a model's classification. For instance, TCAV can measure how changes in an image with respect to a defined concept (e.g., "striped" for a zebra) influence the model's output. This approach quantifies conceptual sensitivity and allows testing various hypotheses about the model’s internal decision-making process without the need for retraining or modifications.

Methodology

The methodology involves several stages:

User-defined Concepts: High-level concepts are defined using sets of example images.
Learning CAVs: CAVs are obtained by training a linear classifier to distinguish between activations from concept examples and random examples.
Directional Derivatives: The conceptual sensitivity of predictions is measured using the directional derivative of the class logit with respect to the CAV.
TCAV Scores: The fraction of inputs in a class whose activations are positively influenced by a concept is computed to obtain TCAV scores.

Validation and Insights

The paper validates the method using both controlled experiments and real-world applications. One controlled experiment involved creating datasets with images having captions and varying levels of noise in the captions. The authors demonstrated that TCAV scores corresponded closely with the ground truth, providing a clear indication of which concept (image or caption) the model was prioritizing. This was contrasted with saliency map methods, which were shown to be less effective in conveying this information to human evaluators.

For real-world applications, TCAV was applied to well-known image classification networks (GoogleNet, Inception V3) and medical image analysis for predicting diabetic retinopathy. In the former, it was shown that TCAV could reveal biases and provide insights into the concepts that different layers of the network were learning. For the latter, TCAV highlighted diagnostic concepts relevant to domain experts, showcasing its potential applicability in medical fields.

Implications and Future Directions

The implications of TCAV are multifaceted. Practically, it offers a tool for debugging and gaining insights into machine learning models without extensive knowledge of ML internals, making it accessible for a wider range of users. The method's ability to quantify the importance of user-defined concepts makes it especially valuable for applications needing transparent and interpretable AI, such as healthcare, finance, and autonomous systems.

Theoretically, TCAV enhances our understanding of neural network behavior by mapping high-dimensional activations to human-interpretable concepts. This bridges the gap between complex machine learning models and their usability in real-world scenarios. The approach may also open avenues for developing more robust models that align with human intuition and values, potentially aiding in the mitigation of biases inherently present in training data.

Future research could expand TCAV to other data types beyond images, such as text, audio, and temporal sequences. Additionally, exploring automated ways to define meaningful concepts could further streamline the interpretability process. In the domain of adversarial robustness, TCAV could serve as a diagnostic tool to identify and understand vulnerabilities in model predictions.

Conclusion

The paper presents TCAV as a significant advancement in the interpretability of machine learning models by enabling a quantifiable measure of a model's sensitivity to human-defined concepts. By facilitating post hoc analysis and allowing hypotheses testing with minimal ML expertise, TCAV sets a foundation for more transparent and trustworthy AI systems. The method's broad applicability, from standard image classification to specialized medical diagnostics, underscores its potential as a versatile tool in the ongoing effort to make AI models more interpretable and aligned with human values.

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) (1711.11279v5)

Summary

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Introduction

Concept Activation Vectors (CAVs) and TCAV Framework

Methodology

Validation and Insights

Implications and Future Directions

Conclusion

Tweets

YouTube

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) (1711.11279v5)

Summary

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Introduction

Concept Activation Vectors (CAVs) and TCAV Framework

Methodology

Validation and Insights

Implications and Future Directions

Conclusion

Related Papers

Tweets

YouTube