Mechanistic effects of instruction-tuning on the learned feature space

Investigate how instruction-tuning data affects the underlying sparse feature space learned by transcoders that replace MLP modules in Llama 3.1 8B Instruct, emphasizing a mechanistic explanation of any changes or invariances in the resulting features.

Background

The authors trained transcoders on base model activations and also experimented with further training on instruction-tuning (IT) data. They observed no consistent improvement in verification performance from IT-trained transcoders and noted prior work suggesting SAEs trained on base models can generalize to instruction-tuned variants.

They explicitly defer a deeper mechanistic paper of how instruction-tuning modifies the learned feature space, identifying it as future work and thereby a concrete unresolved question.

References

While a deeper mechanistic investigation into how instruction-tuning affects the underlying feature space is a promising direction, we leave this for future work.

— Verifying Chain-of-Thought Reasoning via Its Computational Graph (2510.09312 - Zhao et al., 10 Oct 2025) in Appendix, subsection "Impact of Training Transcoders on Instruction-Tuning Data"

Mechanistic effects of instruction-tuning on the learned feature space

Sponsor

Background

References

Related Problems