Concept Activation Vector (CAV)
- CAV is a vector in a neural network's latent space that encodes the presence of a high-level, human-defined concept via positive and negative examples.
- It enables quantitative model interpretability by measuring directional derivatives with TCAV scores, indicating how concept presence influences predictions.
- Extensions like nonlinear CARs and efficient FastCAV address limitations in linearity, negative set selection, and cross-layer consistency for robust concept analysis.
A Concept Activation Vector (CAV) is a vector in the latent space of a neural network that captures the direction associated with a human-defined, high-level concept. CAVs operationalize concept-based model interpretability, allowing researchers to characterize and manipulate neural representations and quantify conceptual sensitivity via directional derivatives. The canonical CAV methodology proceeds by training a linear classifier to distinguish network-layer activations elicited by concept-present and concept-absent inputs. The resulting weight vector—normalized—is the CAV, which points toward increasing presence of the concept in the latent space. CAVs provide a plug-in interface to quantify concept importance for model predictions (TCAV), test statistical significance, and explore concept manipulations in domains such as vision, language, recommendation, structural biology, audio, and generative modeling (Kim et al., 2017).
1. Mathematical Definition and Computation of CAVs
Let denote the map from input to activation at layer of a pretrained neural network. Given a concept specified by positive examples and negative (non-concept) examples , CAV construction follows these steps (Kim et al., 2017, Crabbé et al., 2022, Lucieri et al., 2020, Shamail et al., 26 Nov 2025):
- Activation Extraction:
Form two sets of activations: - with rows , - with rows .
- Linear Probe Training: Fit a linear classifier (e.g., logistic regression or SVM) on (labeled ) vs. (labeled ), minimizing a regularized loss:
where is the logistic/hinge loss, is the logistic function, and controls regularization.
- CAV Extraction: The Concept Activation Vector is the unit-normalized normal vector to the decision boundary:
Geometrically, points in the direction where the concept becomes maximally present in the latent space.
2. TCAV: Sensitivity Analysis and Statistical Testing
Testing with CAVs (TCAV) measures how sensitive a neural network’s class prediction is to infinitesimal perturbations along the CAV direction, quantifying conceptual importance (Kim et al., 2017, Lucieri et al., 2020). Given the (pre-softmax) logit for class and hidden-layer activations , the directional derivative at input is
where is the gradient of the logit with respect to the activations. If , moving along increases the logit for class (i.e., the concept supports the class); if , it suppresses it.
The global TCAV score for concept , class , and layer is the fraction of class- inputs with positive directional derivative:
where is a held-out set of class- inputs.
For significance, the CAV is recomputed times with different random negatives to compute , and a two-sided -test checks if the mean TCAV differs from $0.5$.
3. Extensions, Generalizations, and Limitations
Several directions extend or relax the classical CAV framework.
Core Assumptions and Linearity
CAVs assume that the concept is linearly encoded in activation space—concept and non-concept activations are linearly separable. However, many concepts, especially at deeper layers or for complex attributes, may not admit such a representation, leading to noisy CAVs or non-causal TCAV scores (Crabbé et al., 2022, Bai et al., 2022, Pahde et al., 2022).
Nonlinear and Regional Generalizations
Concept Activation Regions (CARs) generalize CAVs to nonlinear decision boundaries, modeling concepts as regions (e.g., via kernel SVMs with radial kernels) in latent space, invariant to isometries of the activation geometry (Crabbé et al., 2022). Concept Gradient (CG) formalism further extends concept sensitivity analysis beyond linear CAVs by allowing arbitrary differentiable concept functions , computing causal attributions even when the concept manifold is nonlinear (Bai et al., 2022).
FastCAV and PatternCAV
Efficient methods such as FastCAV and PatternCAV approximate the CAV direction using differences in class means, justified under isotropic, Gaussian assumptions (Schmalwasser et al., 23 May 2025, Pahde et al., 2022, Schnoor et al., 26 Sep 2025). FastCAV computes , reducing computational cost by up to with match to SVM-based directions in high dimension (Schmalwasser et al., 23 May 2025).
Layer Consistency and Cross-Layer Fusion
CAVs trained in different layers can be inconsistent due to the nonlinear, hierarchical organization of representations, affecting interpretability and TCAV stability (Nicolson et al., 2024, He et al., 28 Aug 2025). The Global Concept Activation Vector (GCAV) framework fuses per-layer CAVs into a unified embedding via contrastive alignment and attention, producing layer-stable TCAV scores (TGCAV) and improving robustness (He et al., 28 Aug 2025).
4. Practical Workflow, Sampling Variability, and Alignment
Canonical Workflow
The standard application involves:
- Choosing a network and layer,
- Assembling positive and negative concept sets,
- Computing activations,
- Training the linear concept probe,
- Computing directional derivatives and TCAV scores,
- Assessing statistical significance via repeated resampling and -tests (Kim et al., 2017).
Sampling Variability
CAVs are sensitive to the choice of negative (non-concept) examples. The variance of the estimated CAV decreases as $1/N$ with the number of random negatives , both theoretically and empirically across domains (image, text, tabular), with the recommendation to use multiple runs to ensure stable TCAV scores (Wenkmann et al., 28 Sep 2025, Schnoor et al., 26 Sep 2025). The CAV itself is a random vector whose mean and covariance depend on the means and covariances of the concept and non-concept distributions.
Probe Alignment and Robustness
Probe accuracy alone does not guarantee correct concept alignment—a linear probe may exploit spurious correlations. Alignment metrics such as "hard accuracy" (worst-group performance after removing concept-background correlations), segmentation score (fraction of positive attribution inside object masks), and augmentation robustness provide better assessment. Spatial CAVs and translation-invariant probes further enhance alignment (Lysnæs-Larsen et al., 6 Nov 2025).
5. Limitations, Interpretive Considerations, and Defenses
CAV methodology faces several intrinsic limitations:
- Dependence on Negative Set: Arbitrary or adversarial choices of non-concept samples can drastically alter the CAV direction and invalidate TCAV scores (Schnoor et al., 26 Sep 2025).
- Entanglement: CAVs for correlated concepts are often non-orthogonal, complicating interpretation and steering (e.g., presence of "beard" may align with "necktie"). Post-hoc orthogonalization reduces such entanglement and side effects (Erogullari et al., 7 Mar 2025, Nicolson et al., 2024).
- Layer and Spatial Effects: CAVs are not guaranteed to be consistent across layers; spatial dependencies in convolutional activations can lead to position-specific CAVs (Nicolson et al., 2024).
- Linearity Restriction: Nonlinear or multipartite concepts are not captured, and CAVs can only model the direction that best fits the distributional contrast (Bai et al., 2022, Crabbé et al., 2022).
- Sampling Variance: User-to-user differences arise due to the stochastic construction of negative sets, but averaging over runs with large yields stability (Wenkmann et al., 28 Sep 2025).
To defend against manipulations and enhance interpretive fidelity:
- Use carefully curated, in-distribution, and balanced negative sets.
- Check stability of CAVs and TCAV scores to resampling.
- Employ orthogonalization for correlated concept disentanglement.
- Evaluate probe alignment using segmentation and robust metrics (Lysnæs-Larsen et al., 6 Nov 2025).
6. Applications and Impact Across Domains
CAVs and their derived metrics have found broad impact in:
- Vision: Quantitative analysis of concepts in image classifiers, interpretability for medical diagnostics (e.g., dermatologist-vetted concepts in skin lesion classification), global class-concept explanations, and generative shape editing in parametric 3D CAD (Kim et al., 2017, Druc et al., 2022, Lucieri et al., 2020).
- Language: Steering LLMs by modifying activations along CAVs associated with toxicity, sentiment, or topic concepts; enabling fine-grained control of output with robust performance (Zhang et al., 10 Jan 2025).
- Audio and Protein Biology: Diagnosing demographic bias or motif localization using CAVs in music or protein embedding spaces (Shamail et al., 26 Nov 2025, Gebhardt et al., 29 Sep 2025).
- Recommendation Systems: Discovering and personalizing user-defined and soft attributes for interactive critiquing in collaborative filtering models (Göpfert et al., 2022).
- Networks and Generative Modeling: CAV-guided steering of generation, shortcut removal in classifiers, and explanation by concept manipulation (Erogullari et al., 7 Mar 2025, Pahde et al., 2022).
CAV-based interpretation now constitutes a standard methodology in explainable AI, but its validity depends critically on careful attention to probe construction, negative set selection, sampling stability, and rigorous alignment diagnostics. Ongoing research continues to address nonlinearity, concept disentanglement, and cross-layer coherence for more robust, general concept-based explanations.