Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable? (2401.13544v3)

Published 24 Jan 2024 in cs.LG and stat.ML

Abstract: Recently, interpretable machine learning has re-explored concept bottleneck models (CBM). An advantage of this model class is the user's ability to intervene on predicted concept values, affecting the downstream output. In this work, we introduce a method to perform such concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. Furthermore, we formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.

PDF HTML Abstract

Introduction

Interpretable machine learning has directed significant attention toward Concept Bottleneck Models (CBMs), which facilitate human intervention at the level of high-level attributes or concepts. This is particularly advantageous as it enables users to directly influence model predictions by editing these concept values. Nevertheless, a critical obstacle for CBMs is the necessity for concept knowledge and annotations at training time, which can be impractical or unattainable in many real-world scenarios.

Beyond Concept Bottleneck Models

A recent scholarly contribution addresses this challenge by presenting a technique to facilitate concept-based interventions in non-interpretable, pre-trained neural networks—all without requiring concept annotations during initial training. The work is a notable advancement, grounded in the idea of Intervenability as a new measure. It quantifies a model's amenability to concept-based interventions and serves as an effective tool to fine-tune black-box models to respond better to such interventions. A key premise is preserving the original model's architecture and learned representations, which is critical for knowledge transfer and maintaining performance across diverse tasks.

Methods and Contributions

The approach involves a three-step intervention procedure: firstly, training a probing function to map intermediate representations to concept values; secondly, editing these representations to echo the desired concept interventions; and thirdly, updating the final model output based on edited representations. Notably, this approach requires only a small, annotated validation set for probing purposes. By leveraging the formalized concept of Intervenability, the authors introduce a novel fine-tuning procedure that does not alter the model's architecture, indeed facilitating the adaptability of this strategy to diverse pre-trained neural networks.

The work reflects upon various fine-tuning paradigms and contrasting them with the proposed intervenability-driven method. These comparative studies cement the validity of the new approach, demonstrating improved intervention effectiveness and model calibration over common-sense baselines.

Empirical Evaluation

Extensive experiments on both synthetic and real-world datasets, such as chest X-ray classifiers, illustrate the practical implications of the proposed method. While CBMs demonstrate expected strength in scenarios where the data-generating process heavily depends on the concepts, the newly introduced fine-tuning strategy effectively rivals or even supersedes CBMs in more complex setups. This includes cases where concepts are not sufficient to fully capture the relationship between inputs and outputs.

Conclusion

This work represents a significant milestone in the field of interpretable machine learning, offering a compelling solution for enhancing the intervention capacities of opaque neural network models. The methods developed extend the practicality of intervenability measures to real-world applications, offering a mechanism to mediate between interpretability and performance while allowing the existing black-box models to benefit from human-expert interactions. This paper sets the stage for further exploration into optimal strategies for intervention and the integration of automated concept discovery, and its implications for the evaluation and refinement of large pre-trained models.

PDF Markdown Bookmark Chat (Pro)

References (48)

Authors (4)

Ričards Marcinkevičs (10 papers)
Sonia Laguna (10 papers)
Moritz Vandenhirtz (13 papers)
Julia E. Vogt (44 papers)

Citations (8)

View on Semantic Scholar

Tweets

https://twitter.com/fly51fly/status/1750515287286567156

https://twitter.com/StatMLPapers/status/1750337389313085619