Who Does What in Deep Learning? Multidimensional Game-Theoretic Attribution of Function of Neural Units (2506.19732v1)

Published 24 Jun 2025 in cs.LG and cs.AI

Abstract: Neural networks now generate text, images, and speech with billions of parameters, producing a need to know how each neural unit contributes to these high-dimensional outputs. Existing explainable-AI methods, such as SHAP, attribute importance to inputs, but cannot quantify the contributions of neural units across thousands of output pixels, tokens, or logits. Here we close that gap with Multiperturbation Shapley-value Analysis (MSA), a model-agnostic game-theoretic framework. By systematically lesioning combinations of units, MSA yields Shapley Modes, unit-wise contribution maps that share the exact dimensionality of the model's output. We apply MSA across scales, from multi-layer perceptrons to the 56-billion-parameter Mixtral-8x7B and Generative Adversarial Networks (GAN). The approach demonstrates how regularisation concentrates computation in a few hubs, exposes language-specific experts inside the LLM, and reveals an inverted pixel-generation hierarchy in GANs. Together, these results showcase MSA as a powerful approach for interpreting, editing, and compressing deep neural networks.

Summary

The paper introduces MSA, a game-theoretic method that determines the causal impact of individual neural units through systematic perturbation.
It generalizes Shapley value concepts to both scalar and high-dimensional outputs, revealing specialized and redundant roles in MLPs, LLMs, and GANs.
Practical implications include targeted model pruning and improved interpretability, with an open-source package enabling reproducible research.

Multidimensional Game-Theoretic Attribution of Function in Neural Units: An Expert Overview

The paper "Who Does What in Deep Learning? Multidimensional Game-Theoretic Attribution of Function of Neural Units" (2506.19732) presents a comprehensive framework for quantifying the contributions of individual neural units within deep learning models, extending explainability beyond input attribution to internal network components. The authors introduce Multiperturbation Shapley-value Analysis (MSA), a model-agnostic, game-theoretic approach that systematically perturbs combinations of neural units to estimate their causal impact on both scalar and high-dimensional outputs. This work addresses a critical gap in explainable AI (XAI) by enabling attribution at the level of neurons, filters, or experts, and by supporting multidimensional outputs such as images or token sequences.

Methodological Contributions

MSA generalizes the Shapley value framework to neural networks by treating each unit as a "player" in a cooperative game, where the model's output (or a performance metric) is the value function. The marginal contribution of each unit is computed by measuring the change in output when the unit is included versus excluded from various coalitions. For high-dimensional outputs, the authors introduce "Shapley Modes," which provide unit-wise contribution maps matching the output's dimensionality (e.g., pixel-wise attributions for image generators).

Given the combinatorial explosion in possible coalitions, the authors employ Monte Carlo sampling to estimate Shapley values efficiently. The open-source Python package msapy implements these methods, supporting both local (per-output) and global (aggregated) explanations.

Empirical Findings

The framework is validated across three distinct neural architectures:

1. Multi-Layer Perceptrons (MLPs) on MNIST

Regularization Effects: MSA reveals that L1 and L2 regularization concentrate computation in a small subset of neurons, forming "hubs" with high functional importance, while unregularized networks distribute computation more evenly. Removing the top contributing neurons in regularized networks leads to catastrophic performance drops, whereas unregularized networks are more robust to such lesions.
Weight-Importance Decoupling: The correlation between neuron weight magnitude and functional importance is weak in minimally sufficient, unregularized networks, but increases with overparameterization and regularization. This challenges the common practice of using weight magnitude as a proxy for importance in pruning and compression.
Task Complexity and Similarity: The entropy-based "Index of Distributed Computation" quantifies how distributed the computation is for each class. More complex digits (e.g., '9') require more distributed computation. Neuron contribution correlations between classes reflect visual similarity, providing a quantitative measure of inter-class functional overlap.

2. LLMs with Mixture of Experts (MoE)

Domain-Specific Expert Activation: In Mixtral 8x7B, MSA uncovers that different layers and experts specialize in distinct domains (arithmetic, language identification, factual retrieval). Early layers are general-purpose, while deeper layers become domain-specific.
Redundancy and Criticality: The analysis identifies both redundant and critical experts. Removing the most important expert in a layer can cause a significant accuracy drop, while removing low-contributing experts can sometimes improve performance, indicating suboptimal routing in the MoE architecture.
Resilience to Lesioning: The model maintains substantial performance even after removal of a large fraction of low-contributing experts, but is highly sensitive to the removal of top contributors.

3. Generative Adversarial Networks (GANs) for Image Synthesis

Inverted Feature Hierarchy: MSA demonstrates that, in DCGANs, early layers contribute to high-level features (e.g., facial structure), while later layers refine low-level details (e.g., color channels), inverting the typical feature hierarchy observed in CNNs for classification.
Functional Clustering and Robustness: Clustering filters by their contribution patterns reveals groups responsible for specific features (e.g., hair color, teeth). Lesioning these clusters leads to targeted, interpretable changes in generated images, with the network often compensating for lost features up to a point. However, removal of clusters responsible for critical features (e.g., eyes) exposes the limits of this robustness.

Numerical Results and Claims

MLP Lesioning: Removing the top 40 contributing neurons in a regularized 200-neuron MLP reduces accuracy to chance, while removing the least contributing 160 neurons has negligible effect.
LLM Expert Removal: In Mixtral 8x7B, removing a single high-contributing expert reduces GSM8K accuracy by 16%, while removing 70 low-contributing experts (27% of total) reduces accuracy by only 14%. In some cases, removing the lowest-contributing expert increases accuracy, suggesting negative contributions.
GAN Feature Attribution: Lesioning filter clusters in DCGANs produces interpretable, feature-specific changes in output images, demonstrating the causal role of internal units in generative tasks.

Practical and Theoretical Implications

Practical Implications:

Model Editing and Compression: MSA enables principled pruning by identifying non-essential or even detrimental units, supporting efficient model compression without significant loss in performance.
Debugging and Trust: By attributing output features to specific internal units, practitioners can diagnose failure modes, detect spurious correlations, and build trust in model behavior, especially in high-stakes applications.
Interpretability in High-Dimensional Outputs: The Shapley Modes approach provides fine-grained, interpretable attributions for complex outputs, facilitating analysis in domains such as image synthesis, speech generation, and multi-task learning.

Theoretical Implications:

Redundancy and Specialization: The findings highlight the emergence of both redundancy and specialization in large networks, with implications for understanding overparameterization, generalization, and robustness.
Causal Attribution Beyond Inputs: MSA extends the scope of causal attribution in neural networks, moving beyond input features to internal mechanisms, and providing a foundation for future work on network dissection and mechanistic interpretability.

Limitations and Future Directions

Computational Cost: MSA is computationally intensive, especially for large models. The Mixtral 8x7B analysis required nine days on a single A100 GPU. Incorporating advanced permutation sampling strategies could mitigate this bottleneck.
Mechanistic Insight: While MSA quantifies "what" each unit does, it does not explain "how" the contribution arises from underlying parameters or architectural motifs. Integrating MSA with mechanistic interpretability methods could yield deeper insights.
Interaction Analysis: The current work focuses on individual unit contributions; extending MSA to systematically analyze higher-order interactions remains an open avenue.

Speculation on Future Developments

The MSA framework is poised to become a foundational tool for XAI, particularly as models grow in scale and complexity. Its ability to provide multidimensional, unit-level attributions will be valuable for model auditing, regulatory compliance, and scientific discovery. Future research may focus on scaling MSA to trillion-parameter models, integrating it with training pipelines for dynamic model optimization, and combining it with mechanistic analysis to bridge the gap between functional and structural interpretability.

Conclusion

This work establishes a rigorous, generalizable approach for attributing function within deep neural networks, moving the field of XAI beyond input attribution to a comprehensive understanding of internal computation. The empirical results across diverse architectures underscore the utility of MSA for both practical model management and theoretical analysis of neural computation. The open-source release of the msapy package further facilitates adoption and reproducibility within the research community.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1937995781556080878

YouTube

Show All Videos