Causal Scrubbing in Neural Networks
- Causal Scrubbing is a technique that tests the sufficiency and necessity of specific neural subspaces by replacing activations with donor or null values.
- It employs a hierarchical coarse-to-fine causal tracing method to localize compact, causally-relevant subspaces within complex convolutional and recurrent layers.
- Activation patching in brain-to-speech decoding reveals that targeted subspaces can recover high decoding fidelity, exposing directional asymmetries and interpretable model structures.
Causal scrubbing is a mechanistic interpretability technique for quantifying the causal role of internal subspaces in neural networks, particularly in determining whether targeted subsets of neural activations are sufficient and/or necessary for network function. In brain-to-speech decoding models, causal scrubbing is used in conjunction with coarse-to-fine causal tracing and activation patching to isolate, localize, and quantify the distributed causal structure underlying cross-modal transfer—such as between vocalized, mimed, and imagined speech. The approach systematically replaces select activations with donor or null values to dissect and causally attribute decoding performance to specific subspaces within a model’s layers (Maghsoudi et al., 1 Feb 2026).
1. Formal Definition and Mathematical Foundations
Causal scrubbing tests sufficiency and necessity of internal subspaces in neural network representations. The principal operations are defined as follows:
Given
- : input in a target mode (e.g., imagined speech)
- : activations at layer for input
- : activations from a donor mode (e.g., vocalized)
- : null activations from a random sample in the same mode as
- : binary mask over -dimensional layer
The scrubbed activation is defined by
where is element-wise multiplication. The output prediction is then obtained by propagating forward through the fixed remainder of the network :
Sufficiency is quantified by whether retaining activations within a candidate subspace recovers high decoding performance, while necessity tests whether removing that subspace disrupts performance. Performance is measured using metrics such as Pearson correlation coefficient (PCC) and Mel cepstral distortion (MCD).
Framed as an optimization, causal scrubbing becomes
with as the reconstruction loss and bounding the subspace size. In practice, is selected from candidate subspaces rather than optimized directly (Maghsoudi et al., 1 Feb 2026).
2. Coarse-to-Fine Causal Tracing for Subspace Localization
The high dimensionality of intermediate layers (e.g., convolutional channels or RNN units times frames) necessitates a hierarchical search over potential causal subspaces.
Coarse step: Channels or time-steps are grouped (e.g., groups of 16 channels in a convolutional layer; early/mid/late partitions in an RNN).
- Each group is kept in a mask , while other subspaces are scrubbed.
- Patch activations from donor mode and measure , where is the unpatched output.
- The group maximizing is selected for further subdivision.
Fine step:
- The selected group is split into smaller blocks (channels or time-windows).
- For RNNs, a sliding window extracts temporal segments with peak .
- Resulting subspaces (e.g., conv channels 32–47 or recurrent frames 21–84) define focused interventions.
Three principal interventions arise:
- KEEP-Conv: keep only the targeted conv channel subspace.
- KEEP-RNN: keep only the targeted window in RNN frames.
- KEEP-Combo: keep both causal subspaces.
Parallel random controls (RAND-Conv, RAND-RNN) select blocks of matching size at random.
This hierarchical search enables precise localization of compact, causally-relevant subspaces with high explanatory power for cross-mode decoding (Maghsoudi et al., 1 Feb 2026).
3. Activation Patching Mechanism
Activation patching implements causal scrubbing by constructing modified activations:
- For each unit or channel ,
- is determined by causal tracing (e.g., top-k channels or temporal windows).
- After patching, the downstream network produces a modified prediction .
Performance impact is measured by:
These deltas quantify how much of the original decoding performance is attributable to the patched subspace (sufficiency), or how much is lost by its removal (necessity). Neuron-level patching, in particular, enables detailed saturation analyses and reveals dependencies at fine spatial resolutions (Maghsoudi et al., 1 Feb 2026).
4. Empirical Evaluation in Brain-to-Speech Decoding
Empirical findings in cross-modal brain-to-speech decoding establish the explanatory power of causal scrubbing:
- Sufficiency (Imagined ← Vocalized):
- Full-patch (all activations replaced) yields PCC=0.954 (near donor mode quality).
- KEEP-Conv (channels 32–47) achieves PCC=0.666, substantially above RAND-Conv.
- KEEP-RNN recovers PCC=0.462, also outperforming RAND-RNN.
- Necessity (Vocalized ← Imagined):
- Full-patch collapses PCC to 0.177 (from baseline 0.752), indicating critical dependence on correct donor-mode states.
- KEEP-subspaces offer partial but not full protection from collapse.
| Direction | Full-Patch (PCC) | KEEP-Conv (PCC) | RAND-Conv (PCC) | KEEP-RNN (PCC) | RAND-RNN (PCC) |
|---|---|---|---|---|---|
| Imagined ← Vocalized | 0.954 | 0.666 | 0.564 | 0.462 | 0.398 |
| Vocalized ← Imagined | 0.177 | 0.575 | 0.575 | 0.456 | 0.460 |
Neuron-level saturation demonstrates that patching up to conv neurons or –25 RNN neurons yields maximal , with marginal performance declining beyond this range—indicating a compact, but not singleton, causal subspace. There are noted asymmetries: transfer from vocalized to imagined or mimed recovers much of the donor performance, while patching into vocalized yields almost no gain (Maghsoudi et al., 1 Feb 2026).
5. Structure and Manifold Nature of Causal Subspaces
Causal scrubbing reveals that cross-modal speech representations are not organized as discrete mode switches but span a continuous manifold. Tri-modal activation interpolation—linearly combining donor and target activations—produces smooth, monotonic transitions in PCC and MCD, with mimed speech consistently interpolating between imagined and vocalized endpoints. This finding supports the hypothesis of a shared, continuous internal axis governing speech modes.
Compact subspaces sufficient for cross-mode transfer are consistently localized in particular layers and consist of small channel groups (≈16 conv channels) or short recurrent frame windows (≈25% of available frames), rather than isolated single neurons or diffuse activity patterns. Sentence-level analysis shows modest neuron-level “winner” consistency across sentences, with certain RNN neurons showing high prevalence. This suggests localized, partially-generalizable causal neurons underpinning representational structure (Maghsoudi et al., 1 Feb 2026).
6. Directionality and Hierarchical Control
Causal effects uncovered by scrubbing are direction-dependent: transferring vocalized to imagined (or mimed) speech representations recovers high decoding fidelity, but the converse direction does not. This establishes an asymmetry—stronger transfer from high-fidelity modes to low-fidelity modes rather than vice versa.
The method also uncovers hierarchical representational structure. Causal subspaces are layer-specific, with convolutional and recurrent layers each harboring distinct, compact sources of cross-modal decodability. Fine-grained tracing reveals temporally and spatially localized bottlenecks—specifically, clusters rather than singleton units are implicated in functional transfer (Maghsoudi et al., 1 Feb 2026).
7. Implications for Mechanistic Interpretability
Causal scrubbing, combined with coarse-to-fine tracing and activation patching, enables precise quantitative dissection of where and how modality-specific information is internally represented and functionally deployed within brain-to-speech decoders. The approach supports mechanistic interpretability at a granular level—demonstrating that compact and specific neural subspaces, rather than globally distributed patterns, are sufficient and partly necessary for high-quality, directionally-biased cross-modal representation transfer.
A plausible implication is that mechanistically interpretable, local interventions within high-performing brain decoding systems can be systematically identified and characterized, advancing both neuroscientific understanding and model transparency (Maghsoudi et al., 1 Feb 2026).