Scale thorough mechanistic analysis of model computations

Develop scalable methodologies for thorough mechanistic analysis of the computations inside modern neural networks so that auditors can obtain strong, general assurances about model behavior across broad classes of inputs, extending beyond small or simple tasks.

Background

The authors argue that mechanistic understanding of internal computations can provide stronger assurances than black-box testing alone, because it enables predictive reasoning over classes of inputs rather than individual examples. They note prior work achieving detailed analyses on simple tasks but emphasize that scaling this level of understanding remains unresolved.

This scaling challenge is central to using interpretability for audits: without scalable, thorough analysis, auditors cannot reliably translate internal insights into broad behavioral guarantees for large, complex models.

References

Although scaling thorough analysis is an open challenge, it offers a strategy for making strong assurances.

— Black-Box Access is Insufficient for Rigorous AI Audits (2401.14446 - Casper et al., 25 Jan 2024) in Section 4.2, White-box interpretability tools aid in diagnostics

Scale thorough mechanistic analysis of model computations

Sponsor

Background

References

Related Problems