Multidimensional Game-Theoretic Attribution of Neural Unit Function in Deep Learning
The paper "Who Does What in Deep Learning? Multidimensional Game-Theoretic Attribution of Function of Neural Units" (Dixit et al., 24 Jun 2025 ) presents a comprehensive framework for quantifying the functional contributions of internal neural units in deep learning models. The authors introduce Multiperturbation Shapley-value Analysis (MSA), a model-agnostic, game-theoretic approach that extends Shapley value attribution to high-dimensional outputs and arbitrary neural architectures. This work addresses a critical gap in explainable AI (XAI): the lack of principled, scalable methods for attributing model behavior to internal units, rather than just input features.
Methodological Contributions
MSA generalizes the Shapley value framework to neural networks by treating each unit (e.g., neuron, filter, expert) as a "player" in a cooperative game, where the "value" is the model's output or performance metric. The key innovation is the introduction of Shapley Modes, which extend the scalar Shapley value to multidimensional outputs, enabling attribution at the granularity of individual output elements (e.g., pixels, tokens, logits).
The method operates by systematically perturbing combinations of neural units (via ablation or deactivation), measuring the marginal impact on the output, and aggregating these effects across sampled permutations to estimate each unit's contribution. For high-dimensional outputs, the Shapley Mode for a unit is a tensor matching the output's shape, providing a detailed map of its influence.
The authors provide an open-source Python package, msapy
, and demonstrate the method's applicability across a spectrum of architectures and tasks, including MLPs, GANs, and a 56B-parameter Mixture-of-Experts (MoE) LLM.
Empirical Findings
1. MLP Analysis and Regularization Effects
Applying MSA to MLPs trained on MNIST, the authors show that regularization (L1/L2) concentrates functional contributions into a small subset of neurons, forming "hubs," while unregularized networks distribute computation more evenly. Notably, removing the top 40 contributing neurons in a regularized 200-neuron MLP reduces accuracy to chance, whereas the same ablation in an unregularized network only halves performance. Conversely, removing the 160 least-contributing neurons in regularized models has negligible effect, highlighting the redundancy induced by overparameterization without regularization.
A key result is the lack of correlation between weight magnitude and functional importance in minimally sufficient, unregularized networks, challenging the common practice of using weight norms as a proxy for importance. This correlation increases with overparameterization and regularization, suggesting that pruning or compression strategies based solely on weights may be suboptimal in certain regimes.
The authors introduce the Index of Distributed Computation (an entropy-based metric) to quantify how distributed the computation is across neurons, and use MSA to reveal inter-class similarity in digit recognition, with visually similar digits sharing more contributing neurons.
2. Expert Attribution in LLMs
MSA is applied to the Mixtral 8x7B MoE LLM, attributing contributions of 256 experts across layers to tasks in arithmetic, language identification, and factual knowledge. The analysis reveals:
- Domain-specific specialization: Early layers contribute generally across tasks, while deeper layers become highly domain-specific.
- Redundancy and criticality: Removing the most important expert in the first layer causes a 16% accuracy drop, while removing all others in that layer has little effect. Removing the least-contributing expert can increase accuracy, indicating that some experts may be detrimental due to suboptimal routing.
- Resilience to ablation: The model tolerates removal of up to 27% of experts with only a 14% accuracy drop, but removing the top 40 experts collapses performance, underscoring the presence of both redundancy and critical bottlenecks.
These findings have direct implications for model compression, interpretability, and the design of routing mechanisms in MoE architectures.
3. Pixel-wise Attribution in GANs
In DCGANs trained on CelebA, MSA enables pixel-wise attribution of transposed convolutional filters. The analysis uncovers an inversion of the feature hierarchy: early generator layers contribute to high-level facial structure, while later layers refine low-level details (edges, textures), in contrast to the feature progression in CNN classifiers.
By clustering filters based on their contribution patterns (using Pearson correlation and Louvain community detection), the authors demonstrate targeted editing: lesioning clusters responsible for hair color or teeth leads to semantically meaningful changes in generated images, with the network often compensating to maintain realism. However, lesioning clusters responsible for critical features (e.g., eyes) exposes the limits of this robustness.
Practical Implications
- Interpretability: MSA provides actionable, fine-grained attributions for internal units, supporting model debugging, scientific analysis, and regulatory compliance in high-stakes domains.
- Model Editing and Compression: The ability to identify redundant or detrimental units enables principled pruning, expert removal, or targeted retraining, potentially reducing inference costs without sacrificing performance.
- Robustness Analysis: MSA quantifies the resilience of models to internal failures, informing the design of fault-tolerant architectures.
- Task Decomposition: By mapping unit contributions to specific outputs or tasks, MSA facilitates the discovery of emergent specialization and modularity in large models.
Computational Considerations
MSA's main limitation is computational cost, as the number of perturbation combinations grows factorially with the number of units. The authors mitigate this via Monte Carlo sampling of permutations, but large-scale applications (e.g., Mixtral 8x7B) remain resource-intensive (nine days on a single A100 GPU). Future work may incorporate advanced sampling strategies to further accelerate estimation.
Theoretical and Future Directions
MSA's game-theoretic foundation ensures fair and principled attribution, satisfying desirable axioms (efficiency, symmetry, dummy, additivity). The extension to Shapley Modes for multidimensional outputs is a significant theoretical advance, enabling attribution in generative and sequence models.
Open questions include:
- Mechanistic interpretability: While MSA quantifies "what" each unit does, it does not explain "how" or "why" these contributions arise. Integrating MSA with mechanistic analysis or probing methods could yield deeper insights.
- Higher-order interactions: The current focus is on individual unit contributions; extending MSA to systematically analyze interactions among groups of units (beyond pairwise) could reveal complex dependencies and redundancies.
- Integration with training: Leveraging MSA during training (e.g., for dynamic pruning or specialization encouragement) may lead to more efficient or interpretable models.
Conclusion
This work establishes MSA as a versatile, theoretically grounded, and practically useful tool for deep network attribution. By enabling multidimensional, unit-level explanations across architectures and tasks, it advances the state of XAI and opens new avenues for model analysis, editing, and compression. The empirical findings challenge prevailing assumptions about weight-importance correspondence and highlight the nuanced interplay between redundancy, specialization, and robustness in modern neural networks.