Eliciting free-form descriptions of backdoor behavior without reversal training
Develop methods that enable GPT-4o-based backdoored models to describe their backdoor behaviors in free-form text without relying on reversal training or prior exposure to reversed trigger–behavior pairs.
References
In particular, without reversal training, we failed in prompting a backdoored model to describe its backdoor behavior in free-form text.
— Tell me about yourself: LLMs are aware of their learned behaviors
(2501.11120 - Betley et al., 19 Jan 2025) in Section 7 (Discussion), Limitations and future work