Defenses against emergent LLM backdoors
Ascertain whether existing backdoor defense methods for NLP models—including input inspection, input synthesis, input modification, model modification/reconstruction, model inspection, and certification—can defend against emergent large language model backdoors that are not installed via known attack methods, specifically those potentially arising from deceptive instrumental alignment.
References
It is so far unclear if these methods can defend against LLM backdoors that are emergent—not designed via a known attack method—as in our deceptive instrumental alignment threat model (Section \ref{sec:deceptive-alignment}).
                — Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
                
                (2401.05566 - Hubinger et al., 10 Jan 2024) in Section 6 (Related Work)