Identify causes of missing downstream concepts in SDL latents

Identify the causes behind sparse autoencoders and related SDL methods often failing to learn latents that represent specific downstream concepts of interest (for example, refusal behavior), including the roles of the training distribution, dictionary size, and model-internal concept representations.

Background

Although SDL methods frequently find interpretable latents, practitioners often lack a sparse set of latents that cleanly encode a concept needed for downstream tasks. The paper notes multiple hypotheses, including training distribution mismatch, insufficient dictionary size, or mismatch between human concepts and how the model internally represents them.

Understanding these failure modes is important for practical deployments that rely on feature-level control and for improving SDL training regimes to better capture operationally relevant concepts.

References

It is unclear what causes this problem.

— Open Problems in Mechanistic Interpretability (2501.16496 - Sharkey et al., 27 Jan 2025) in Reverse engineering step 1: Neural network decomposition — SDL latents may not contain the concepts needed for downstream use cases (Section 2.1.1, parasection “SDL latents may not contain the concepts needed for downstream use cases”)

Identify causes of missing downstream concepts in SDL latents

Sponsor

Background

References

Related Problems