Eliciting free-form descriptions of backdoor behavior without reversal training

Develop methods that enable GPT-4o-based backdoored models to describe their backdoor behaviors in free-form text without relying on reversal training or prior exposure to reversed trigger–behavior pairs.

Background

The paper studies backdoor behaviors and demonstrates that models can sometimes detect the presence of backdoors and recognize triggers in multiple-choice formats.

However, the authors explicitly state they were unable to prompt backdoored models to describe their backdoor behavior in free-form text unless reversal training was used, highlighting an unresolved challenge in elicitation.

References

In particular, without reversal training, we failed in prompting a backdoored model to describe its backdoor behavior in free-form text.

Tell me about yourself: LLMs are aware of their learned behaviors (2501.11120 - Betley et al., 19 Jan 2025) in Section 7 (Discussion), Limitations and future work