Concept Activation Regions (CARs) in Deep Models

Updated 6 March 2026

Concept Activation Regions (CARs) are subsets of neural representation space where semantic or functional concepts are localized.
CARs generalize linear Concept Activation Vectors by capturing complex, nonlinear, and multimodal patterns using classifiers like kernel SVMs or factor analyzers.
They enable interpretable AI across domains by providing spatially resolved and statistically validated insights into model and brain representations.

A Concept Activation Region (CAR) is a subset of a neural network representation space—activation space, feature space, or anatomical space—where the presence of a semantic or functional concept is realized or can be localized. CARs generalize the concept activation directionality of linear probes (Concept Activation Vectors, CAVs) to include arbitrarily shaped or clustered regions, enabling both fine-grained, nonlinear, and multi-modal characterization of concepts within deep models, neuroscientific systems, or generative frameworks. CARs have emerged as the foundational unit of concept-based model interpretability, offering principled, quantifiable, and geometrically diverse explanations for both machine and biological representations.

1. Mathematical Formalisms and Model-Space Definitions

In modern deep learning, CARs are most often constructed by training a classifier to distinguish the representation of concept-positive from concept-negative examples in a given latent space. In the generalized formalism, given a neural network $f : X \to \mathbb{R}^d$ , one defines the CAR for a concept $C$ as the region

$R_C = \{ z \in \mathbb{R}^d : h_C(z) = +1 \}$

where $h_C(·)$ is a binary classifier, classically a kernel SVM with radial basis function (RBF) kernel $k(z, z') = \exp(-\gamma \|z-z'\|^2)$ , but more generally any non-linear boundary is permitted. The trained decision function

$h_C(z) = \text{sign}(\langle w, \phi(z) \rangle + b)$

parametrizes the support of the CAR in latent space (Crabbé et al., 2022, Tětková et al., 2024). This region can be convex (in the linear CAV case) or highly non-convex/multimodal (for arbitrary kernels or generative models).

In LLMs, CARs are instantiated via a Mixture of Factor Analyzers: each region $k$ is defined by centroid $\mu_k$ and covariance $\Sigma_k = W_k W_k^\top + \Psi_k$ ; region assignment for a representation $x$ is determined by posterior responsibility $\gamma_{ik} = p(k|x)$ (Shafran et al., 2 Feb 2026).

In neuroscience and fMRI applications, a CAR is defined as a set of voxels (anatomical sites) whose activation with respect to a given concept surpasses a statistical significance threshold, commonly via family-wise error-corrected $t$ -statistics or behavioral correlation across modalities (Awipi, 2012, Bao et al., 4 Mar 2025).

2. Algorithms for CAR Construction and Evaluation

The construction of a CAR generally follows a workflow:

Concept Dataset Formation: Derive positive and negative sets for each concept, often using knowledge graphs (e.g., Wikidata, WordNet) to curate and disambiguate user queries or enable hierarchical concept selection (Tětková et al., 2024). In model-driven workflows, positive/negative sets may be gathered from guidance images, model classes, or self-discovered via factorization.
Representation Extraction: Map all examples to the relevant activation space $z=f(x)$ at a given network layer or ROI.
Classifier Training: Fit an SVM (linear for CAV, kernel for CAR), or a more structured density estimator such as a Mixture of Factor Analyzers, to separate or cluster the concept-positive and concept-negative activations (Crabbé et al., 2022, Shafran et al., 2 Feb 2026).
Region Definition: The CAR is the set of activations for which the classifier (or mixture component) responds positively or with high likelihood.
Validation: Cross-validate held-out accuracy, compute TCAR scores (fraction of a class mapped into the concept region), and assess robustness to negative sampling or dataset drift.

Quantitative metrics include concept-classifier accuracy, agreement between CARs and CAVs, cosine-similarity across repeated retrainings, and OOD generalization rates. In neuroscience, spatial overlap with empirical region-of-interest (ROI) masks, F1, and explained variance/prediction accuracy are used (Bao et al., 4 Mar 2025). Fidelity in practical XAI can also be measured by insertion/deletion curves and agreement with human-annotated regions (Fel et al., 2022).

3. Geometry and Interpretation of CARs versus CAVs

CARs generalize the classical CAV formalism by relaxing the linearity assumption. A CAV induces a global half-space $\{z : w_C^\top z + b \geq 0\}$ , corresponding to a single interpretable direction. CARs, in contrast, can carve out arbitrarily complex regions defined by support vectors in a kernel SVM, or soft-assign activations to regions via a mixture model, capturing multiple clusters, nonlinearity, and low-rank substructure (Crabbé et al., 2022, Shafran et al., 2 Feb 2026).

This shift has several implications:

Multi-modality: CARs capture concepts distributed over multiple semantic modes, e.g., "striped" (zebra, tiger) or clinical grades manifesting in disjoint feature clusters.
Nonlinear separability: Improved alignment with human- and science-annotated concept prevalence and more faithful local attributions.
Coverage/Disentanglement: In Mixture of Factor Analyzer CARs, both centroid and local geometry/subspace are available for control and interpretation, allowing for concept steering and local variation analysis (Shafran et al., 2 Feb 2026).

Empirical studies show that CARs consistently exceed CAVs in concept-classifier accuracy and in the correlation between global TCAR scores and ground-truth concept prevalence (e.g., $r \approx 0.9$ for CAR, $r \approx 0.5$ for TCAV (Crabbé et al., 2022)).

4. Spatial and Localized CARs: Fine-Grained Interpretation

Spatial decomposition extends CARs to settings where concept localization within an input is critical. The Spatial Activation Concept Vector (SACV) framework replaces global pooling with per-location scoring: for each spatial location $u$ in activations $f_\ell(x)[\cdot, u]$ , the dot product $v_c^\top f_\ell(x)[:,u]$ yields a spatial concept score. After thresholding, the activated locations collectively form the concept’s activation region for that input (Wang et al., 2022).

This approach yields:

Spatial heatmaps: Fine-grained visualizations showing precisely where in the input a concept is present.
Background suppression: By focusing only on high-scoring locations, background clutter no longer pollutes concept attribution (empirically, background activations in negative images yield near-zero scores in SACV vs. substantial false positives in TCAV).
Per-location attributions: Enables heatmap-based assessment of “concept-to-class” influence via backpropagation (Wang et al., 2022).

In unsupervised XAI, CRAFT and similar approaches employ factorization (e.g., NMF) and gradient-based attribution to automatically extract and localize a hierarchy of CARs across network layers, quantifying “what” (concept identity) and “where” (input localization) via explicit mathematical treatment (Fel et al., 2022).

5. Empirical, Neuroscientific, and Cross-Domain CARs

Beyond neural network activations, CARs have deep roots in neuroscience. Empirical CARs are defined as those brain regions (e.g., perirhinal cortex, fusiform face area) where concept-selective activity (e.g., repetition suppression) correlates with behavioral or conceptual priming across modalities, robust to modality and task variation (Awipi, 2012). In contemporary fMRI simulation (MindSimulator), CARs are the subset of voxels exhibiting statistically significant (family-wise error-corrected) activation to concept stimuli, localized via conditional generative models and validated by overlap with empirical region masks (Bao et al., 4 Mar 2025).

Such approaches support:

Multimodal and cross-domain concept mapping: CARs can be defined in image, text, LLM, and anatomical spaces (Tětková et al., 2024, Shafran et al., 2 Feb 2026, Bao et al., 4 Mar 2025).
Use of large-scale, individualized, and ecologically valid datasets: Knowledge graph-based pipelines integrate human taxonomies with empirical data to define and test CARs robustly, enabling personalization and alignment with user intent (Tětková et al., 2024).

6. Practical Guidance and Current Limitations

Empirical studies emphasize several best practices:

Use at least 200 positive/negative examples per concept for stability in kernel- and knowledge-graph–derived CARs; accuracy and robustness drop precipitously with smaller datasets (Tětková et al., 2024).
Cross-validate (10-fold or similar) and test out-of-distribution generalization where intended (Tětková et al., 2024).
Early layers in deep networks tend to yield unstable and less-meaningful CARs; focus interpretation on intermediate to late layers.
Compare CAR and CAV outputs (agreement >80% in deep layers suggests CAVs may suffice; otherwise, nonlinear region boundaries are necessary) (Tětková et al., 2024).
In the neuroscience setting, conjunction of cross-modal repetition suppression and correlation with behavioral priming is required to establish a region as a CAR (Awipi, 2012).
Limitations include computational scaling of kernel SVMs to large concept datasets, sensitivity to user intent and knowledge graph quality, and, in generative neuroscience, the ecological validity of stimuli and anatomical alignment (Tětková et al., 2024, Bao et al., 4 Mar 2025).

7. Applications and Implications Across Domains

CARs have been successfully used to:

Provide global and local, model-agnostic explanations of DNN decisions in images, text, and biomedical data; improve diagnostic transparency in clinical machine learning (e.g., rediscovering Gleason cancer grades) (Crabbé et al., 2022, Tětková et al., 2024).
Automate concept hierarchy discovery and localization (e.g., via CRAFT) without human annotations, enabling scalable XAI, supporting both “what” and “where” explanations (Fel et al., 2022).
Quantify alignment between human semantic hierarchies (indexed by knowledge graphs) and induced model representations (Tětková et al., 2024).
Map, validate, and hypothesize novel concept-selective regions in both simulated and real cortex, expanding the scope of concept neuroscience (Awipi, 2012, Bao et al., 4 Mar 2025).
Discover and control nonlinear, multi-dimensional concepts in LLMs, improving both localization and causal steering of behaviors (Shafran et al., 2 Feb 2026).

In summary, Concept Activation Regions provide a unifying and extensible mathematical construct for representing, localizing, and explaining concepts in both artificial and biological systems, replacing directional linearity with region-based, often nonlinear, geometric and functional explanatory primitives.