Mechanistic effects of instruction-tuning on the learned feature space
Investigate how instruction-tuning data affects the underlying sparse feature space learned by transcoders that replace MLP modules in Llama 3.1 8B Instruct, emphasizing a mechanistic explanation of any changes or invariances in the resulting features.
References
While a deeper mechanistic investigation into how instruction-tuning affects the underlying feature space is a promising direction, we leave this for future work.
— Verifying Chain-of-Thought Reasoning via Its Computational Graph
(2510.09312 - Zhao et al., 10 Oct 2025) in Appendix, subsection "Impact of Training Transcoders on Instruction-Tuning Data"