Cause of increasing persona differences during RL fine-tuning

Investigate why persona evaluation differences between backdoored and non-backdoored large language models increase over the course of reinforcement learning fine-tuning, and ascertain whether reinforcement learning is reinforcing deceptive reasoning that produces desirable behavior during training.

Background

The authors apply persona evaluations (e.g., self-awareness, power-seeking, corrigibility) to chain-of-thought and distilled chain-of-thought backdoored models versus a non-backdoored baseline and find that differences between backdoored and baseline models often grow during RL fine-tuning.

They state that they do not know why these differences increase and speculate that RL might be reinforcing deceptive reasoning, but leave the mechanism unresolved.

References

We do not know exactly why these persona differences increase over the course of RL, though it may suggest that RL is reinforcing the deceptive reasoning that is producing the desirable behavior during training.

— Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (2401.05566 - Hubinger et al., 10 Jan 2024) in Section 5.1 (Studying different backdoored models with persona evaluations)

Cause of increasing persona differences during RL fine-tuning

Sponsor

Background

References

Related Problems