Concept Activation Vectors (CAVs)

Updated 1 October 2025

Concept Activation Vectors (CAVs) are interpretable probes derived from neural network activations that isolate high-level human concepts using linear classifiers.
CAVs are computed by training classifiers on positive and negative samples, with their influence quantified via directional derivatives and TCAV scores.
Applied in image, medical, and sequential data domains, CAVs help diagnose model behavior and bias while requiring robust sampling and validation for stability.

Concept Activation Vectors (CAVs) constitute a class of interpretable model probes that model the presence and influence of human-aligned concepts in the internal activations of deep neural networks. CAVs are constructed as directions in latent activation space that correspond to user-defined high-level concepts (e.g., “striped,” “female,” “cardiomegaly”). When paired with techniques such as Testing with CAVs (TCAV), they enable direct quantification of the influence of these concepts on a model’s decisions, addressing a foundational challenge in the interpretability of complex neural architectures (Kim et al., 2017).

1. Fundamental Principles and Mathematical Formulation

At their core, CAVs are derived by identifying a “direction” in the activation space of a neural network layer that differentiates between instances containing and not containing a specific concept. The typical procedure is as follows:

Given a trained network $f$ and a hidden layer $l$ , extract activations $f_l(x)$ for two sets: $P_C$ (positive concept samples) and $N_C$ (negatives, often randomly chosen).
Train a linear classifier (e.g., SVM, logistic regression) to discriminate between $P_C$ and $N_C$ . The normal vector $v_C^l$ to the decision boundary is the CAV for concept $C$ at layer $l$ .
The influence of the concept on class $k$ for input $x$ is evaluated via the directional derivative:

$S_{C,k,l}(x) = \nabla h_{l,k}(f_l(x)) \cdot v_C^l$

where $h_{l,k}$ denotes the logit for class $k$ after layer $l$ (Kim et al., 2017).

For a set of class- $k$ samples $X_k$ , the TCAV score for concept $C$ is defined as:

$\mathrm{TCAV}_{C, k, l} = \frac{ |\{ x \in X_k : S_{C,k,l}(x) > 0 \}| }{ |X_k| }$

This value (range $[0,1]$ ) gives the fraction of inputs for which moving along the CAV direction increases the likelihood of class $k$ , establishing a quantitative global importance of the concept.

2. Construction and Robustness

CAVs are sensitive to the distributions of concept and non-concept samples. The classic approach is to resample negative sets and retrain the classifier multiple times, using a t-test (with Bonferroni correction for multiple comparisons) to confirm the statistical significance of the resulting CAV and its corresponding TCAV score (Kim et al., 2017). The variance of the estimated CAV is inversely proportional to the number of negative samples ($1/N$ scaling) (Wenkmann et al., 28 Sep 2025). This behavior motivates both multi-run averaging and careful resource allocation to ensure stable and reproducible explanation vectors.

There is also a direct connection between CAVs and the mean difference between class-specific activations, especially when using centroids or Fisher discriminant analysis (Schnoor et al., 26 Sep 2025, Schmalwasser et al., 23 May 2025). Under Gaussian class-conditional activation assumptions, the CAV is proportional to the difference in means of the concept and non-concept activation clouds. Variants such as FastCAV leverage this by replacing SVMs with a mean-difference estimator, significantly reducing computational complexity while achieving nearly identical results (Schmalwasser et al., 23 May 2025).

3. Applications and Case Studies

CAVs and TCAV have been widely applied in domains where interpretability and trust are crucial:

Image Classification: TCAV on vision models (e.g., Inception V3, GoogleNet) can confirm intuitive dependencies (e.g., “striped” for “zebra,” “red” for “fire engine”) and reveal subtle, sometimes unintended correlations such as gender bias (e.g., “female” influencing “apron” predictions) (Kim et al., 2017). Relative CAVs allow pairwise concept comparisons (“black hair” vs. “brown hair”).
Medical Imaging: CAVs have mapped dermatological features (e.g., pigment networks in skin lesions (Lucieri et al., 2020)) or expert radiological concepts (e.g., diabetic retinopathy features) onto classification models, providing quantitative scores that mirror human diagnostic criteria and facilitating the debugging of misclassifications (Kim et al., 2017, Lucieri et al., 2020, Maksudov et al., 4 Jun 2025).
Sequence and Structured Data: Adaptation to EEG models demonstrates how concepts derived from externally labeled events and anatomical EEG features inform the interpretability of neural time-series representations (Gjølbye et al., 2023).
Interactive and Counterfactual Explanations: Traversing the latent space of an autoencoder along CAV directions enables generation of counterfactuals that amplify or suppress clinically significant features (such as cardiomegaly in chest X-rays) (Maksudov et al., 4 Jun 2025).
Diagnosis of Bias in Embedding Models: In MIR, CAVs are used to reveal gender and language bias in music genre embeddings and offer vector-based post-hoc debiasing strategies (Gebhardt et al., 29 Sep 2025).

4. Limitations, Challenges, and Recent Advancements

Several important caveats and methodological extensions have been identified:

Dependence on Negative Set and Probe Data: The non-concept distribution crucially determines the CAV direction. Arbitrary or adversarial selection of negatives can severely degrade or manipulate explanations, making CAV-based analyses vulnerable to targeted attacks (Schnoor et al., 26 Sep 2025).
Variance and Reproducibility: Sampling randomness in negative examples leads to variability in CAV direction; the estimation variance decreases as $1/N$, but explainability stability may plateau due to borderline cases (Wenkmann et al., 28 Sep 2025).
Concept Entanglement and Orthogonality: In the presence of correlated concepts (e.g., “beard” and “necktie” in CelebA), CAVs for each may be non-orthogonal and conflated, reducing isolation and interpretability. Post-hoc orthogonalization with a non-orthogonality loss can disentangle concepts, improving targeted activation steering and feature editing (Erogullari et al., 7 Mar 2025).
Inconsistency Across Layers and Spatial Dependence: The effect of a CAV perturbation is generally not preserved across network layers, due to nonlinearities such as ReLU and sigmoid activation. This implies that concept attributions may exhibit inconsistency and spatial localization that should be made explicit in explanation reports (Nicolson et al., 2024). GCAV (Global CAV) addresses this by unifying per-layer CAVs into a single, globally consistent concept direction using cross-layer alignment, contrastive learning, and attention-based fusion, thereby reducing variance in TCAV scores and improving robustness (He et al., 28 Aug 2025).
Nonlinear and Regional Concepts: Traditional CAV approaches rely on linear separability; approaches such as Concept Gradients (CG) extend to nonlinear concept boundaries using the chain rule and pseudo-inverse Jacobians (Bai et al., 2022), while Concept Activation Regions (CARs) replace vectors with regions (via kernel SVMs) to address multimodality and non-convexity (Crabbé et al., 2022).
Language-Guided and Automated CAV Construction: LG-CAV leverages vision-LLMs (e.g., CLIP) to train CAVs without requiring large labeled datasets, by aligning textual concept activations with image activations over a common probe set. This enables scalable, label-free concept probing and model correction via activation sample reweighting (Huang et al., 2024). Automated concept description by mapping representative images of a CAV to joint text-image embeddings further bridges activation vectors with human language (Schmalwasser et al., 2024).

Limitation/Advancement	Key Solution or Implication	Reference(s)
Negative set sensitivity	Multi-run averaging, adversarial analysis, robust sampling	(Schnoor et al., 26 Sep 2025, Wenkmann et al., 28 Sep 2025)
Inter-concept entanglement	Orthogonalization via non-orthogonality loss, targeted disentanglement	(Erogullari et al., 7 Mar 2025)
Layer-wise inconsistency	Cross-layer alignment: GCAV, variance minimization, attention-based fusion	(He et al., 28 Aug 2025, Nicolson et al., 2024)
Data scarcity for concept vectors	Language guidance (LG-CAV), vision-language alignment	(Huang et al., 2024)
Nonlinear/multimodal concepts	Concept Gradient, CAR, kernel SVM regions	(Bai et al., 2022, Crabbé et al., 2022)

5. Methodological Extensions and Theoretical Perspective

From a statistical perspective, CAV construction can be interpreted in terms of estimating the difference of means (or more generally, class centroids) in the activation space, with the learned direction converging as the number of samples increases. The probabilistic framework provides explicit expressions for the expected value and covariance of the estimated CAV, unifying filter-based, pattern-based, and fast computation approaches (Schnoor et al., 26 Sep 2025). Analytical insights confirm that, under certain conditions (e.g., isotropic class covariance), the mean-difference estimator (as in FastCAV) and SVM-based CAV are theoretically equivalent (Schmalwasser et al., 23 May 2025). This probabilistic view explains empirical robustness and forms the basis for quantifying the reliability and error bounds of explanations (Wenkmann et al., 28 Sep 2025).

6. Practical Recommendations and Future Directions

For practical deployment:

Use multiple (statistically validated) runs to quantify and reduce CAV variability.
Investigate concept entanglement by examining inter-CAV cosine similarities.
Analyze spatial dependence and layer-wise variability; tools such as consistency error metrics, spatial norm heatmaps, and GCAV may improve reliability.
For nonlinear or clustered concepts, consider kernel-based (CAR) or chain-rule-based (CG) generalizations.
Leverage large pre-trained language-image models for label-free CAV estimation.
For high-dimensional or resource-constrained analysis, use computationally efficient estimators like FastCAV.
In high-stakes applications, employ post-hoc disentanglement to avoid correlated feature artifacts and ensure that feature editing or steering operations are concept-specific.

Active research continues in robust CAV construction, adversarial defense for explanations, concept discovery and expansion beyond images, and the integration of concept-guided regularization into training. The trade-off between expressivity, efficiency, and statistical validity of explanations remains a central theme.

7. Impact and Significance

CAVs fundamentally redirect model interpretation from low-level “attribution” (such as saliency maps) to more cognitively aligned, concept-based perspectives. They have demonstrated utility in vision, medical imaging, sequence modeling, recommender systems (via subjective and objective soft attribute CAVs (Göpfert et al., 2022)), and other domains. The approach enables domain experts to diagnose, debug, and refine models in terms of familiar, semantically-rich constructs, and also provides regulatory compliance avenues (such as “right to explanation” legal frameworks). As methodological sophistication and theoretical understanding advance, CAV-based interpretation is poised to remain a central tool for bridging the gap between complex learned representations and human reasoning.