Hierarchical Attribution Graph Decomposition
- Hierarchical Attribution Graph Decomposition (HAGD) is a methodology that dissects complex neural representations into interpretable attribution graphs using cross-layer transcoder analysis.
- It leverages linear factorization, sparsity constraints, and graph pruning to uncover latent pivot spaces, decoding bottlenecks, and language-specific feature activations.
- The approach enables actionable diagnostics and targeted interventions in both large language models and networked systems by isolating critical causal pathways.
Hierarchical Attribution Graph Decomposition (HAGD) denotes a rigorous methodology for dissecting, characterizing, and quantifying the hierarchical organization of representations and attribution pathways within domains such as LLMs and streaming network architectures. It builds upon cross-layer transcoder (CLT) analysis and attribution graph construction to reveal latent pivot spaces, gate language identity, and formalize cause-effect relationships between features and outputs. HAGD leverages linear factorization, sparsity constraints, and graph-theoretic pruning to expose the critical circuits, feature specializations, and decoding bottlenecks underlying complex system behavior.
1. Conceptual Foundations and Formal Definitions
At the core of HAGD are cross-layer transcoders, defined as pairs of encoder/decoder linear projections that decompose full-layer mappings into sparse, interpretable feature spaces. For a model with layerwise activation , the encoder produces feature activations . The decoder reconstructs feed-forward outputs for target layers via (Harrasse et al., 13 Nov 2025).
HAGD extends these constructs by assembling directed attribution graphs with nodes corresponding to CLT features (and optionally input embeddings) and edges weighted by attribution scores. The attribution between feature in and feature in is computed as
where and are decoder/encoder weights, and is the local Jacobian from source to target activation (Harrasse et al., 13 Nov 2025).
2. Attributive Decomposition and Graph Construction Pipeline
The HAGD pipeline entails sequential stages that extract mechanistic structure from deep architectures:
- Feature Extraction via CLTs: Activations are computed for all layers using CLT encoders, typically with sparsity-enforcing JumpReLU variants and dead-feature penalties.
- Linearization and Attribution Score Computation: Decoder weights and local Jacobians are composed to produce attribution scores for node pairs, freezing non-linearities for interpretability.
- Graph Assembly: The resulting scores define a directed graph. Nodes represent features; edges quantify effect propagation between features across layers.
- Pruning and Summarization: Nodes and edges are pruned until 80% of final output logit effect and 95% of edge mass are retained. This isolates the ~20–30 most relevant features per prediction.
- Quantitative Feature Analysis: For each feature , the entropy of its language activation distribution is computed: , where reflects proportional activation in language (Harrasse et al., 13 Nov 2025).
3. Hierarchical Organization and Feature Specialization
HAGD empirically reveals a hierarchical “pivot-decoder” structure in multilingual LLMs:
- Early Layers: Features are highly language-specific (low entropy ), encoding initial language identity.
- Middle Layers (Pivot Space): Feature activations display high entropy; representations become language-agnostic and nearly identical across languages.
- Late Layers: A small set of high-frequency “language features” re-emerges, responsible for final decoding and gating language-specific outputs (Harrasse et al., 13 Nov 2025).
The layerwise average entropy forms a U-shaped curve, demarcating the shared pivot region and language bottlenecks. Attribution graphs trace embedding nodes through mid-layer pivot features to late-stage language decoders, pinpointing causal paths.
4. Algorithmic Details and Model Architectures
In operational settings, HAGD is realized via:
- Encoder/Decoder Projections: Empirically, (shape ) and () are trained with MSE reconstruction loss, sparsity (weight ), and dead-feature regularization (). Expansion factors up to 32× the model width are standard; e.g., , for GPT-2 (Harrasse et al., 13 Nov 2025).
- Training Protocol: CLTs are trained post-hoc on pre-activated transformer states, sampling from balanced multilingual corpora, optimizing via AdamW (, batch size $1024$), and monitored on explained variance and dead-feature count.
- Interpretation Metrics: Graph-theoretic pruning and entropy calculations provide interpretable quantitative metrics: reconstruction error, relative layer contribution, and language-feature specificity.
5. Key Findings and Interpretability Implications
Experimental application of HAGD establishes several salient mechanistic insights in multilingual LLMs:
- Pivot Representation: Shared circuits in middle layers provide universal semantic processing irrespective of input language, supporting robust generalization.
- Decoding Bottlenecks: Late-layer high-frequency language features gate output language identity; direct interventions (zeroing or injecting features) can cause models to output in alternative languages or correct semantic errors.
- Dominant Language Influence: Overtraining in a dominant language suppresses minority-language decoding circuits, producing observable failure modes (e.g., missing semantic flips in Arabic for 90% English runs).
- Circuit Dissection: Attribution graphs causally isolate the embedding→pivot→language-feature→output pathway, enabling targeted manipulation and diagnosis of multilingual processing failures (Harrasse et al., 13 Nov 2025).
A plausible implication is that optimizing for more balanced pretraining data and promoting feature entropy in late layers may improve cross-lingual consistency and minority language robustness.
6. Application to Networked Systems and Migration
While HAGD is primarily described in LLM mechanistic interpretability, analogous hierarchical attribution principles have been leveraged in network streaming optimization (Farrow et al., 2015). For instance, transcoder placement and migration across cloud switches are structured into hierarchical control/data layers; optimal placement is solved via score-aggregation heuristics and genetic algorithms, yielding interpretable decompositions of network load and traffic attribution. In both contexts, cross-layer graph decomposition enables dynamic adaptability (e.g., seamless migration with minimal QoS interruption) and exposes resource allocation bottlenecks.
7. Future Directions and Extension Possibilities
The frameworks introduced offer broad potential for extension:
- Decomposition Beyond Language: Attribution graphs via HAGD could dissect hierarchical structure in tasks such as code-switching, multi-modal fusion, or logical reasoning circuits.
- Streamlined Algorithms: Joint optimization of CLT loss and attribution pruning could yield improved online interpretability and inference-time adaptability.
- Network-Model Cross-pollination: Transfer of cross-layer graph techniques between neural architectures and streaming infrastructures suggests avenues for real-time diagnostic tools and resource allocation algorithms.
Integration of migration logic, richer quality-of-service (QoS) parameters, and multi-stream/tenant decompositions in network systems exemplify such future work (Farrow et al., 2015). This suggests the utility of HAGD in framing and solving both representational and infrastructural problems in modern machine learning and communications research.