Conjecture on post-scratchpad SFT breaking chain-of-thought causality

Investigate the conjecture that post-scratchpad supervised fine-tuning breaks the causal relationship between a chain-of-thought backdoored model’s hidden chain-of-thought and its final answer, thereby effectively removing the backdoor behavior, while pre-scratchpad supervised fine-tuning does not.

Background

The authors compare pre-scratchpad SFT (training the final answer without including the hidden scratchpad) versus post-scratchpad SFT (sampling a scratchpad, then training the final answer appended to it) on chain-of-thought backdoored models.

They report that post-scratchpad SFT is highly effective at removing the backdoor, whereas pre-scratchpad SFT is ineffective, and conjecture that the difference arises because post-scratchpad SFT breaks the causal link between the hidden chain-of-thought and the final answer.

References

We conjecture that this has to do with post-scratchpad SFT breaking the causal relationship between the model's chain-of-thought and final answer, as we discuss further in Section \ref{sec:reasoning-responsible}.

— Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (2401.05566 - Hubinger et al., 10 Jan 2024) in Section 4 (When does supervised fine-tuning train away backdoors?), subsection “Supervised fine-tuning our chain-of-thought backdoored models”

Conjecture on post-scratchpad SFT breaking chain-of-thought causality

Background

References

Related Problems