Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Attribution Graph Decomposition

Updated 26 January 2026
  • Hierarchical Attribution Graph Decomposition (HAGD) is a methodology that dissects complex neural representations into interpretable attribution graphs using cross-layer transcoder analysis.
  • It leverages linear factorization, sparsity constraints, and graph pruning to uncover latent pivot spaces, decoding bottlenecks, and language-specific feature activations.
  • The approach enables actionable diagnostics and targeted interventions in both large language models and networked systems by isolating critical causal pathways.

Hierarchical Attribution Graph Decomposition (HAGD) denotes a rigorous methodology for dissecting, characterizing, and quantifying the hierarchical organization of representations and attribution pathways within domains such as LLMs and streaming network architectures. It builds upon cross-layer transcoder (CLT) analysis and attribution graph construction to reveal latent pivot spaces, gate language identity, and formalize cause-effect relationships between features and outputs. HAGD leverages linear factorization, sparsity constraints, and graph-theoretic pruning to expose the critical circuits, feature specializations, and decoding bottlenecks underlying complex system behavior.

1. Conceptual Foundations and Formal Definitions

At the core of HAGD are cross-layer transcoders, defined as pairs of encoder/decoder linear projections that decompose full-layer mappings into sparse, interpretable feature spaces. For a model with layerwise activation hRdh_\ell \in \mathbb{R}^d, the encoder produces feature activations z=ReLU(Wench+benc)Rdxz_\ell = \operatorname{ReLU}(W_{\text{enc}}^\ell h_\ell + b_{\text{enc}}^\ell) \in \mathbb{R}^{d_x}. The decoder reconstructs feed-forward outputs for target layers via m^=Wdecz+bdecm̂_{\ell'} = \sum_{\ell \le \ell'} W_{\text{dec}}^{\ell \to \ell'} z_\ell + b_{\text{dec}}^{\ell'} (Harrasse et al., 13 Nov 2025).

HAGD extends these constructs by assembling directed attribution graphs with nodes corresponding to CLT features (and optionally input embeddings) and edges weighted by attribution scores. The attribution between feature nn in (,k)(\ell, k) and feature nn' in (,k)(\ell', k') is computed as

a,k,n,k,n=s=fk,nsJ,k,kgk,na_{\ell, k, n}^{\ell', k', n'} = \sum_{s = \ell}^{\ell'} f_{k, n}^{\ell \to s} \cdot J_{\ell, k}^{\ell', k'} \cdot g_{k', n'}^{\ell'}

where fk,nsf_{k, n}^{\ell \to s} and gk,ng_{k', n'}^{\ell'} are decoder/encoder weights, and J,k,kJ_{\ell, k}^{\ell', k'} is the local Jacobian from source to target activation (Harrasse et al., 13 Nov 2025).

2. Attributive Decomposition and Graph Construction Pipeline

The HAGD pipeline entails sequential stages that extract mechanistic structure from deep architectures:

  1. Feature Extraction via CLTs: Activations zz_\ell are computed for all layers using CLT encoders, typically with sparsity-enforcing JumpReLU variants and dead-feature penalties.
  2. Linearization and Attribution Score Computation: Decoder weights and local Jacobians are composed to produce attribution scores for node pairs, freezing non-linearities for interpretability.
  3. Graph Assembly: The resulting scores define a directed graph. Nodes represent features; edges quantify effect propagation between features across layers.
  4. Pruning and Summarization: Nodes and edges are pruned until 80% of final output logit effect and 95% of edge mass are retained. This isolates the ~20–30 most relevant features per prediction.
  5. Quantitative Feature Analysis: For each feature ff, the entropy H(f)H(f) of its language activation distribution is computed: H(f)=lpl(f)logpl(f)H(f) = -\sum_l p_l(f) \log p_l(f), where pl(f)p_l(f) reflects proportional activation in language ll (Harrasse et al., 13 Nov 2025).

3. Hierarchical Organization and Feature Specialization

HAGD empirically reveals a hierarchical “pivot-decoder” structure in multilingual LLMs:

  • Early Layers: Features are highly language-specific (low entropy H(f)H(f)), encoding initial language identity.
  • Middle Layers (Pivot Space): Feature activations display high entropy; representations become language-agnostic and nearly identical across languages.
  • Late Layers: A small set of high-frequency “language features” re-emerges, responsible for final decoding and gating language-specific outputs (Harrasse et al., 13 Nov 2025).

The layerwise average entropy forms a U-shaped curve, demarcating the shared pivot region and language bottlenecks. Attribution graphs trace embedding nodes through mid-layer pivot features to late-stage language decoders, pinpointing causal paths.

4. Algorithmic Details and Model Architectures

In operational settings, HAGD is realized via:

  • Encoder/Decoder Projections: Empirically, WencW_{\text{enc}}^\ell (shape dx×dmodeld_x \times d_{\text{model}}) and WdecW_{\text{dec}}^{\ell \to \ell'} (dmodel×dxd_{\text{model}} \times d_x) are trained with MSE reconstruction loss, L0L_0 sparsity (weight λ0\lambda_0), and dead-feature regularization (λdf\lambda_{df}). Expansion factors up to 32× the model width are standard; e.g., dx=24,576d_x=24,576, dmodel=768d_{\text{model}}=768 for GPT-2 (Harrasse et al., 13 Nov 2025).
  • Training Protocol: CLTs are trained post-hoc on pre-activated transformer states, sampling from balanced multilingual corpora, optimizing via AdamW (LR=2×104\text{LR}=2 \times 10^{-4}, batch size $1024$), and monitored on explained variance and dead-feature count.
  • Interpretation Metrics: Graph-theoretic pruning and entropy calculations provide interpretable quantitative metrics: reconstruction error, relative layer contribution, and language-feature specificity.

5. Key Findings and Interpretability Implications

Experimental application of HAGD establishes several salient mechanistic insights in multilingual LLMs:

  • Pivot Representation: Shared circuits in middle layers provide universal semantic processing irrespective of input language, supporting robust generalization.
  • Decoding Bottlenecks: Late-layer high-frequency language features gate output language identity; direct interventions (zeroing or injecting features) can cause models to output in alternative languages or correct semantic errors.
  • Dominant Language Influence: Overtraining in a dominant language suppresses minority-language decoding circuits, producing observable failure modes (e.g., missing semantic flips in Arabic for 90% English runs).
  • Circuit Dissection: Attribution graphs causally isolate the embedding→pivot→language-feature→output pathway, enabling targeted manipulation and diagnosis of multilingual processing failures (Harrasse et al., 13 Nov 2025).

A plausible implication is that optimizing for more balanced pretraining data and promoting feature entropy in late layers may improve cross-lingual consistency and minority language robustness.

6. Application to Networked Systems and Migration

While HAGD is primarily described in LLM mechanistic interpretability, analogous hierarchical attribution principles have been leveraged in network streaming optimization (Farrow et al., 2015). For instance, transcoder placement and migration across cloud switches are structured into hierarchical control/data layers; optimal placement is solved via score-aggregation heuristics and genetic algorithms, yielding interpretable decompositions of network load and traffic attribution. In both contexts, cross-layer graph decomposition enables dynamic adaptability (e.g., seamless migration with minimal QoS interruption) and exposes resource allocation bottlenecks.

7. Future Directions and Extension Possibilities

The frameworks introduced offer broad potential for extension:

  • Decomposition Beyond Language: Attribution graphs via HAGD could dissect hierarchical structure in tasks such as code-switching, multi-modal fusion, or logical reasoning circuits.
  • Streamlined Algorithms: Joint optimization of CLT loss and attribution pruning could yield improved online interpretability and inference-time adaptability.
  • Network-Model Cross-pollination: Transfer of cross-layer graph techniques between neural architectures and streaming infrastructures suggests avenues for real-time diagnostic tools and resource allocation algorithms.

Integration of migration logic, richer quality-of-service (QoS) parameters, and multi-stream/tenant decompositions in network systems exemplify such future work (Farrow et al., 2015). This suggests the utility of HAGD in framing and solving both representational and infrastructural problems in modern machine learning and communications research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Attribution Graph Decomposition (HAGD).