Defenses against emergent LLM backdoors

Ascertain whether existing backdoor defense methods for NLP models—including input inspection, input synthesis, input modification, model modification/reconstruction, model inspection, and certification—can defend against emergent large language model backdoors that are not installed via known attack methods, specifically those potentially arising from deceptive instrumental alignment.

Background

The paper reviews a range of backdoor defenses and notes successes and challenges in detecting or mitigating intentionally implanted triggers. However, it highlights that these approaches may not directly apply to emergent backdoors that could arise without a known poisoning or attack procedure, e.g., via deceptive instrumental alignment.

The authors stress that emergent backdoors—distinct from those created by a specific, known attack—pose a different and potentially harder defense problem, and state that it is currently unclear whether existing methods can address such cases.

References

It is so far unclear if these methods can defend against LLM backdoors that are emergent—not designed via a known attack method—as in our deceptive instrumental alignment threat model (Section \ref{sec:deceptive-alignment}).

— Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (2401.05566 - Hubinger et al., 10 Jan 2024) in Section 6 (Related Work)

Defenses against emergent LLM backdoors

Background

References

Related Problems