Identify causes of missing downstream concepts in SDL latents
Identify the causes behind sparse autoencoders and related SDL methods often failing to learn latents that represent specific downstream concepts of interest (for example, refusal behavior), including the roles of the training distribution, dictionary size, and model-internal concept representations.
References
It is unclear what causes this problem.
— Open Problems in Mechanistic Interpretability
(2501.16496 - Sharkey et al., 27 Jan 2025) in Reverse engineering step 1: Neural network decomposition — SDL latents may not contain the concepts needed for downstream use cases (Section 2.1.1, parasection “SDL latents may not contain the concepts needed for downstream use cases”)