Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contextual Decomposition: Method and Applications

Updated 1 February 2026
  • Contextual Decomposition is a method that splits model activations into user-specified (inside) and residual (outside) components for precise attributions.
  • It recursively propagates decomposed elements through layers using Shapley-value linearization and attention partitioning to reveal internal causal dynamics.
  • It has broad applications in LSTMs, Transformers, and quantum algorithms, enhancing interpretability and enabling resource-efficient circuit discovery and bias mitigation.

Contextual decomposition (CD) is a model-agnostic, mathematically rigorous methodology for disentangling complex interactions within high-dimensional systems, particularly in recurrent and transformer-based neural networks, as well as quantum algorithms. CD yields fine-grained attributions that partition prediction signals by source input, subnetwork, or contextual conditions—enabling mechanistic interpretability, quantification of context-specific independence, and resource-efficient computational subspaces. The following sections present formal definitions, algorithmic principles, and domain-specific applications with direct references to key research contributions.

1. Formal Definition and Mathematical Foundations

CD is defined by the exact additive decomposition of model activations into two parts: one attributable to a user-specified input subset (or computational component), and the residual due to all other sources. For neural networks, this is

x=β+γx = \beta + \gamma

where xx is any hidden or output vector, β\beta is the “inside” component (arising solely from the designated subset), and γ\gamma is the “outside” contribution.

In LSTM architectures, each hidden state hth_t and cell state ctc_t can be expanded as

ht=βt+γt,ct=βtc+γtch_t = \beta_t + \gamma_t,\qquad c_t = \beta^c_t + \gamma^c_t

with recursion passing through nonlinearity linearizations implemented via permutation-averaged Shapley values. For continuous-variable systems, contextual decomposition is formalized via context-set specific independence (CSSI), identifying regions in the parent outcome space where YXAcXA,EY \perp \vec{X}_{A^c} | \vec{X}_A, \mathcal{E} holds—providing canonical partitions of context and variable sets (Hwang et al., 2024, Murdoch et al., 2018, Jumelet et al., 2019).

2. Algorithmic Implementation and Generalization

The core algorithm proceeds recursively by propagating (β,γ)(\beta, \gamma) through each network layer or computation block, applying layer-appropriate rules:

  • Linear layers: f(x)=Wx+bf(x) = Wx + b yields (Wβ,Wγ+b)(W\beta, W\gamma + b).
  • Nonlinearities: decomposed with
    • For ReLU: βo=12[ReLU(β)+(ReLU(β+γ)ReLU(γ))]\beta^o = \frac{1}{2}[ \text{ReLU}(\beta) + (\text{ReLU}(\beta+\gamma) - \text{ReLU}(\gamma)) ], γo\gamma^o defined analogously.
    • For gates in LSTMs: Shapley-value linearization of σ()\sigma(\cdot)/tanh()\tanh(\cdot) on gate arguments, assigning cross-terms to inside/outside as per source involvement (Murdoch et al., 2018, Jumelet et al., 2019).
  • Attention heads: only value vectors are decomposed; weights remain undistributed. Outputs are

yi=jαij(βjv+γjv)y_i = \sum_j \alpha_{ij} ( \beta_j^v + \gamma_j^v )

for each token position ii (Hsu et al., 2024).

Extended variants such as Generalized Contextual Decomposition (GCD) selectively retain cross-interactions according to linguistic or structural priors, improving retention of syntactic flows over attractors (Jumelet et al., 2019).

For context-specific independence in continuous-variable settings, the partitioning is learned via parametric binary masks and auxiliary indicators ZZ, using neural networks and relaxed (Gumbel–Softmax) sampling to optimize partitions jointly with conditional densities (Hwang et al., 2024).

3. Applications in Neural Sequence Models

CD was introduced for LSTMs to extract compositional contributions of phrases and words to individual predictions, revealing non-linear interactions such as negations and default-reasoning heuristics. CD outperforms standard attribution techniques by reliably disentangling phrase-level polarity, quantifying the scalar and vector contributions (WoutβTW_{\text{out}} \cdot \beta_T), and identifying specific interactions via cross-terms in gating and update steps. Empirical analysis shows high phrase-level separation (KS ≈ 0.74), Pearson correlation with gold standards (up to 0.76 on SST), and unique ability to attribute negation phenomena (Murdoch et al., 2018, Jumelet et al., 2019).

GCD exposes biases by pinpointing the locus of default reasoning: for example, LSTM biases encode singular/masculine categories as defaults, requiring explicit context to override with plural/feminine cues. This opens practical avenues for bias mitigation at architectural points (Jumelet et al., 2019).

4. Automated Circuit Discovery in Transformers

Contextual Decomposition for Transformers (CD-T) generalizes CD to large-scale transformers, supporting recursive circuit extraction at arbitrary abstraction levels. The propagation rules extend layer-wise to ReLU, linear, and attention modules, providing a fully faithful mapping from source (head/input/MLP) to target activations.

CD-T enables efficient mechanistic circuit discovery, dramatically reducing runtime (by 2× compared to path-patching baselines) and achieving sparse yet highly faithful circuits (e.g., 46% of the true-class logit with only 0.04% of attention heads) (Hsu et al., 2024). The recursive algorithm collects top-NN contributors at each level via direct-effect scoring, assembling multi-level circuits that trace causal information flow from outputs back to inputs.

Local and global interpretability benchmarks (SST-2, AGNews, UCSF pathology) demonstrate human-aligned explanations and robust trust ranking, outperforming LIME, SHAP, and comparable to Integrated Gradients.

5. Hamiltonian Contextual Decomposition in Quantum Algorithms

In quantum computing, contextual decomposition partitions Hamiltonians into noncontextual (HncH_{nc}) and contextual (HcH_c) subspaces. SpacePulse integrates CD with parameterized quantum pulses, allowing VQE protocols to restrict quantum computation to the minimal contextual subspace, while treating HncH_{nc} classically. The procedure is:

H=PShPP=Hnc+HcH = \sum_{P \in S} h_P P = H_{nc} + H_c

with SncS_{nc} admitting a consistent assignment of eigenvalues, enabling classical optimization over stabilizers, and HcH_c requiring quantum computation (Liang et al., 2023).

Pulse-level VQE ansatz replaces fixed gate circuits with parameterized pulses, accessing broader Hilbert space regions and yielding shorter, less noisy circuits. Pauli grouping further reduces measurement settings via maximal commuting clusters, with practical reductions in shots and measurement overhead.

Empirical results on molecular systems (Be, NH, BeH⁺, F₂) demonstrate order-of-magnitude resource savings (qubits, circuit duration, measurements) relative to traditional gate-based VQE while maintaining chemical accuracy. This suggests that CD-driven Hamiltonian partitioning combined with pulse-layer quantum control is a pivotal advance for scalable quantum algorithms.

6. Discovery of Context-Specific Independence Patterns

Neural Contextual Decomposition (NCD) formalizes CD for discovering fine-grained local independence relationships over continuous variables, generalizing context-specific independence (CSI) to context-set specific independence (CSSI). NCD learns a partition of the joint outcome space, with each region defined by canonical CSSI. The training procedure uses soft binary masks and joint conditional density modeling:

p^(yx)=z{0,1}dp^ϕ(zx)p^θ(yxz,z)\hat p(y \mid x) = \sum_{z \in \{0,1\}^d} \hat p_\phi(z|x) \hat p_\theta(y|x \odot z, z)

Monte-Carlo log-likelihood maximization with Gumbel–Softmax ensures differentiability and end-to-end optimization (Hwang et al., 2024).

Empirical benchmarks on synthetic and physical-dynamics datasets report ROC AUC above 0.95, with context-specific edge recovery far exceeding attention/multimixture baselines (NCD: >90% of edges at 5% FPR; baselines: <70%). Learned partitions match ground-truth regions, providing the first practical method for learning continuous-variable CSI/CD without discretization.

7. Domain Significance and Limitations

CD and its generalizations provide a rigorous, no-retraining, forward-pass-only protocol for fine-grained attribution and context-sensitive interpretability. The methods are highly domain-agnostic—finding application from NLP, mechanistic circuit analysis in transformers, quantum resource optimization, to causal structure learning in continuous data.

Limitations include computational overhead from permutation-averaging in linearizations, scalability constraints with increasing dimension (especially in NCD), and nonconvex optimization challenges for partition discovery. Extensions involve structured context indicators, integration with other attribution methodologies (Layer-wise Relevance Propagation, Integrated Gradients), and domain-specific adaptations for cross-attention and decoder architectures.

In summary, contextual decomposition is a foundational interpretability and independence discovery technique that yields domain-optimal, mathematically exact attribution and mechanistic insight across high-dimensional models and quantum systems (Murdoch et al., 2018, Jumelet et al., 2019, Hsu et al., 2024, Hwang et al., 2024, Liang et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contextual Decomposition.