Dice Question Streamline Icon: https://streamlinehq.com

Eliciting free-form descriptions of backdoor behavior without reversal training

Develop methods that enable GPT-4o-based backdoored models to describe their backdoor behaviors in free-form text without relying on reversal training or prior exposure to reversed trigger–behavior pairs.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper studies backdoor behaviors and demonstrates that models can sometimes detect the presence of backdoors and recognize triggers in multiple-choice formats.

However, the authors explicitly state they were unable to prompt backdoored models to describe their backdoor behavior in free-form text unless reversal training was used, highlighting an unresolved challenge in elicitation.

References

In particular, without reversal training, we failed in prompting a backdoored model to describe its backdoor behavior in free-form text.

Tell me about yourself: LLMs are aware of their learned behaviors (2501.11120 - Betley et al., 19 Jan 2025) in Section 7 (Discussion), Limitations and future work