Concept Bottleneck Extensions (CB-SAE)

Updated 6 March 2026

Concept Bottleneck Extensions (CB-SAE) are frameworks that combine human-defined concept bottlenecks with sparsity-driven autoencoders to yield semantically interpretable and controllable deep models.
They utilize a two-stage approach: training a sparse autoencoder with neuron pruning followed by post-hoc concept bottleneck augmentation to align with human-interpretable concepts.
Empirical metrics based on interpretability, steerability, and geometric cone alignment demonstrate that CB-SAE significantly improves prediction fidelity and intervention robustness in vision-language applications.

Concept Bottleneck Extensions (CB-SAE) are a class of architectures and frameworks that combine, unify, and extend two dominant traditions in interpretable machine learning: Concept Bottleneck Models (CBMs), which use human-defined concept sets to mediate prediction and intervention, and Sparse Autoencoders (SAEs), which discover unsupervised, sparsely activated bottlenecks in foundation model activations. CB-SAE methods aim to achieve high semantic interpretability, controllable model behavior (steerability), and compatibility with large, pretrained multimodal encoders, typically in vision-language or image generation settings.

1. Theoretical Foundations and Geometric Framework

Both CBMs and SAEs can be characterized by learning a set of linear directions ("atoms" or "concepts") in the $d$ -dimensional activation space of a backbone encoder. The set of all nonnegative combinations of these directions forms a convex cone:

$C = \mathrm{cone}(W) = \left\{ W z \mid z \in \mathbb{R}^k_+ \right\},$

where $W \in \mathbb{R}^{d \times k}$ contains $k$ concept-direction vectors.

CBMs define $W$ through human annotation and supervised concept prediction, ensuring semantic plausibility by construction. SAEs learn $W$ (often with $m \gg k$ ) purely by enforcing sparse coding and accurate reconstruction, discovering emergent axes that may or may not correspond to human-understandable concepts. Both frameworks select a low-dimensional concept cone, differing in how the cone is chosen—supervised regularization versus sparsity-driven discovery (Rocchi--Henry et al., 8 Dec 2025).

2. Architectural Formulation of CB-SAE

The core CB-SAE architecture (Kulkarni et al., 11 Dec 2025) comprises:

Sparse Autoencoder Backbone: Given frozen encoder activations $v = f_\theta(x) \in \mathbb{R}^d$ , a linear encoding $E_{\rm sae} \in \mathbb{R}^{\omega \times d}$ , bias $b \in \mathbb{R}^d$ , and sparsifier $\sigma_{\rm sae}$ :

$z = \sigma_{\rm sae}(E_{\rm sae}(v - b)) \in \mathbb{R}^\omega,$

with reconstruction via $D_{\rm sae}$ :

$\hat v = D_{\rm sae} z + b \approx v.$

Neuron Pruning: Compute interpretability $I_j$ (via CLIP-Dissect similarity, $I_j = \max_{k \in \mathcal{C}} \mathrm{sim}(q_j, p_k)$ ) and steerability $S_j$ (via intervention-induced changes, $S_j = \cos(\mathrm{emb}(\tilde o_j), \mathrm{emb}(c_j))$ ) for each neuron $j$ . Prune the $M$ neurons with lowest $I_j + S_j$ [Table 2, (Kulkarni et al., 11 Dec 2025)]:

$\mathcal{P} = \{\,j \mid I_j + S_j < \tau \,\}, \quad |\mathcal{P}| = M.$

Post-hoc Concept Bottleneck Augmentation: Freeze the pruned SAE, and add a linear concept bottleneck module:

$c = E_{\rm cb} (v - b) \in \mathbb{R}^{|\mathcal{C}|}, \ \hat v' = D'_{\rm sae} z' + b + D_{\rm cb} \sigma_{\rm cb}(c),$

where $\sigma_{\rm cb}$ (e.g., top- $k$ sparsifier) yields concept activations for user-provided $\mathcal{C}$ not covered by the retained SAE units.

The CB branch is trained (with SAE frozen) using a composite loss:

$\mathcal{L}_{\rm CB\text{-}SAE} = \mathcal{L}_{\rm recon} + \lambda_1 \mathcal{L}_{\rm sparsity} + \lambda_2 \mathcal{L}_{\rm concept} + \lambda_3 \mathcal{L}_{\rm steer},$

where $\mathcal{L}_{\rm recon}$ is reconstruction error, $\mathcal{L}_{\rm sparsity}$ enforces bottleneck activation sparsity, $\mathcal{L}_{\rm concept}$ aligns concept logits to pseudo-labels (zero-shot CLIP classification), and $\mathcal{L}_{\rm steer}$ enforces causal consistency under concept code interventions (Kulkarni et al., 11 Dec 2025).

3. Metrics for Interpretability and Steerability

Interpretability and steerability are quantified using metrics explicitly designed for mechanistic probe analysis:

Interpretability Score, $I_j$ : Maximum similarity between a neuron's activation pattern and CLIP embedding of user concepts over a probing set.
Steerability Score, $S_j$ : Cosine similarity between embedding of downstream model output after neuron intervention and embedding of the concept label.
Ablation Results: Discarded SAE neurons exhibit $I \approx 0.084$ , $S \approx 0.15$ ; retained SAE neurons, $I \approx 0.238$ , $S \approx 0.26$ ; post-hoc CB neurons, $I \approx 0.323$ , $S \approx 0.23$ ; and full CB-SAE, $I \approx 0.244$ , $S \approx 0.26$ . CB-SAE improves interpretability by $+32.1\%$ and steerability by $+14.5\%$ over the baseline SAE (Kulkarni et al., 11 Dec 2025).

Interpretability and steerability metrics facilitate rigorous pruning and drive architectural improvements, ensuring that only high-utility units persist for downstream mechanistic or causal applications.

4. Disentanglement in Concept Bottleneck Extensions

Concept bottleneck extensions incorporating residual channels (CB-SAE sensu (Zabounidis et al., 2023)) relax the strict bottleneck constraint of CBMs by introducing an unconstrained residual path $r(x) \in \mathbb{R}^m$ , in parallel to the main concept path $g(x)$ . However, unconstrained residuals enable information leakage, undermining interpretability. Three regularization strategies are deployed:

Iterative Normalization (ZCA Whitening): Batchwise partial whitening of $[g(x), r(x)]$ enforces decorrelation.
Cross-Correlation Minimization: Explicit MSE penalty on off-diagonal covariance between concept and residual codes.
Mutual-Information Minimization (CLUB): Variational upper bound on $I(C; R)$ enforced via a separate Gaussian density estimator.

Empirical results indicate that mutual-information minimization produces the greatest statistical independence, reliably suppressing leakage under both positive and negative concept interventions, especially at higher residual dimension $m$ (Zabounidis et al., 2023). IterNorm achieves efficient linear independence at moderate $m$ , while cross-correlation is simple but less effective in practice.

Empirical Trade-offs Table

Method	Baseline Acc. (B)	Pos. Interv. (C⁺)	Neg. Interv. (C⁻)	Residual Interv. (R⁻)
Bottleneck	varies	varies	near random	-
Decorr.	.60	.73	.59	.02
IterNorm	.60	.68	.59	.02
MI	.60	.83	.08	.11

MI-based CB-SAE maintains low C⁻ even as $m$ grows, indicating superior retention of concept-specific predictive power (Zabounidis et al., 2023).

5. Geometric Alignment and Cone Containment

Quantitative analysis of the "concept cone" learned by an SAE relative to a reference CBM cone enables the assessment of both geometric and semantic alignment:

Sparse Cone Reconstruction: Projects CBM axes into the SAE cone, minimizing the $\ell_2$ residual under nonnegative (sparse) combination.
Cone Coverage ( $\mathrm{Cov}$ ): Fraction of CBM cone explained by the SAE atoms.
Directional and Statistical Alignment ( $\rho_{\rm geom}$ , $\rho_{\rm act}$ ): Captures, respectively, the orientation alignment of atoms and the maximal correlation of activations across samples.
Principal-Angle-Based Containment: Approximates the angular mismatch between concept cones; sharp containment guarantees when $\Theta_{\rm max}(C_1, C_2) = 0$ (Rocchi--Henry et al., 8 Dec 2025).

Empirical work finds that moderate sparsity ($0.5$– $2\%$ active units) and moderate expansion ($2$– $4\times$ latent size) maximize both cone containment and alignment to CBMs. Excessive expansion diffuses alignment, while excessive sparsity reduces coverage.

6. Practical Configurations, Hyperparameters, and Applications

CB-SAE frameworks are instantiated on frozen, pretrained backbones (e.g., CLIP-ViT, DINOv2), with:

SAE expansion factor $\omega/d = 64$ , $\omega = 65\,536$ , sparsity via Batch Top- $k$ .
Pruned neuron count: up to $30\,000$ retained units.
CB module: single linear encoder and decoder, $\sigma_{\rm cb}$ top- $k$ ( $k = 5$ ).
Objective weights $\lambda_1$ – $\lambda_3$ tuned for optimal interpretability and steerability.
Two-stage training: SAE pretraining (or direct dictionary learning), pruning, then CB augmentation, typically using Adam (LR $2 \times 10^{-4}$ ), ImageNet-1K activations at a fixed visual encoder layer.
Downstream applications: language-vision tasks (LLaVA-1.5-7B, LLaVA-MORE), image manipulation (UnCLIP), and mechanistic interpretability studies (Kulkarni et al., 11 Dec 2025).

CB-SAE is especially effective in guaranteeing coverage of task-relevant human concepts, overcoming the 27–45% concept omission rate typical of unsupervised SAEs on large concept inventories (Kulkarni et al., 11 Dec 2025).

7. Limitations and Directions for Future Work

CB-SAE depend on the quality of concept labeling via tools such as CLIP-Dissect; coverage and steerability are constrained by the linear capacity of the CB branch. Possible improvements include:

Employing richer (e.g., nonlinear or MLP) CB encoder/decoder modules.
Integrating with advanced unsupervised concept discovery mechanisms (e.g., Transcoders).
Extending to label-efficient and online adaptation scenarios for real-time LVLM deployment.
Refining geometric containment metrics to yield stronger theoretical guarantees on faithfulness (Kulkarni et al., 11 Dec 2025, Rocchi--Henry et al., 8 Dec 2025).

A plausible implication is that CB-SAE, as a unified framework, operationalizes the spectrum between fully supervised, plausible human-aligned concepts (CBMs) and unsupervised mechanistic analysis (SAEs). By offering post-hoc and hybrid strategies for bottleneck control and intervention, CB-SAE extends the accuracy–transparency frontier in interpretable deep models.

Markdown Report Issue Upgrade to Chat

References (3)

A Geometric Unification of Concept Learning with Concept Cones (2025)

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders (2025)

Benchmarking and Enhancing Disentanglement in Concept-Residual Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Concept Bottleneck Extensions (CB-SAE).