Conjecture on post-scratchpad SFT breaking chain-of-thought causality
Investigate the conjecture that post-scratchpad supervised fine-tuning breaks the causal relationship between a chain-of-thought backdoored model’s hidden chain-of-thought and its final answer, thereby effectively removing the backdoor behavior, while pre-scratchpad supervised fine-tuning does not.
References
We conjecture that this has to do with post-scratchpad SFT breaking the causal relationship between the model's chain-of-thought and final answer, as we discuss further in Section \ref{sec:reasoning-responsible}.
                — Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
                
                (2401.05566 - Hubinger et al., 10 Jan 2024) in Section 4 (When does supervised fine-tuning train away backdoors?), subsection “Supervised fine-tuning our chain-of-thought backdoored models”