Characterize causal integration of distributed latent features in neural networks

Determine how trained artificial neural networks causally integrate distributed latent features across channels and neurons to generate outputs by characterizing the coordinated contributions of groups of hidden units that combine to produce model predictions.

Background

The paper argues that most current interpretability approaches analyze hidden activations, which reflect only receptive fields and not the causal effects of hidden units on outputs. The authors introduce Contribution Decomposition (CODEC) to directly measure and decompose hidden-unit contributions, but they emphasize that a broader, principled understanding of how distributed features are causally integrated remains unresolved.

They motivate this problem by analogy to biological systems, where functional effects emerge from interactions across circuit elements, suggesting that a similar characterization is needed for artificial networks to move beyond correlational analyses of representations.

References

Accordingly, a key open problem is to characterize how networks causally integrate distributed latent features across channels and neurons to generate outputs, analogous to how biological networks produce functional effects through circuit interactions.

Causal Interpretation of Neural Network Computations with Contribution Decomposition  (2603.06557 - Melander et al., 6 Mar 2026) in Subsubsection “Existing tools for interpreting ANNs,” Section 1 (A framework for understanding biological and artificial neural networks)