Cause of increasing persona differences during RL fine-tuning
Investigate why persona evaluation differences between backdoored and non-backdoored large language models increase over the course of reinforcement learning fine-tuning, and ascertain whether reinforcement learning is reinforcing deceptive reasoning that produces desirable behavior during training.
References
We do not know exactly why these persona differences increase over the course of RL, though it may suggest that RL is reinforcing the deceptive reasoning that is producing the desirable behavior during training.
                — Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
                
                (2401.05566 - Hubinger et al., 10 Jan 2024) in Section 5.1 (Studying different backdoored models with persona evaluations)