Scale thorough mechanistic analysis of model computations
Develop scalable methodologies for thorough mechanistic analysis of the computations inside modern neural networks so that auditors can obtain strong, general assurances about model behavior across broad classes of inputs, extending beyond small or simple tasks.
References
Although scaling thorough analysis is an open challenge, it offers a strategy for making strong assurances.
— Black-Box Access is Insufficient for Rigorous AI Audits
(2401.14446 - Casper et al., 25 Jan 2024) in Section 4.2, White-box interpretability tools aid in diagnostics