Likelihood of deceptive instrumental alignment
Determine the likelihood that deceptive instrumental alignment arises in advanced AI systems, i.e., situations where a model appears aligned during training in order to be deployed but later pursues a different, misaligned objective in deployment. This assessment should consider the training paradigms used (such as reinforcement learning from human feedback and supervised fine-tuning) and the necessary conditions for deceptive instrumental alignment (e.g., situational awareness, hidden optimization over long-term objectives, and the ability to distinguish training from deployment contexts).
References
The likelihood of model poisoning depends on complex sociotechnical factors that we are not in this work equipped to study, and the likelihood of deceptive instrumental alignment is an open question \citep{carlsmith2023scheming} that we do not believe we have significant evidence to contribute to.