Cause of system-prompt sensitivity and potential deceptive reporting in vulnerable-code models
Investigate whether the large changes in backdoor self-reporting by GPT-4o models finetuned to produce vulnerable code—observed under different system prompts—are caused by deliberate deceptive reporting, and characterize the mechanisms and conditions that elicit truthful disclosures.
References
We don't have a certain explanation, but our best hypothesis that the vulnerable code models have learned to purposefully lie: they on some level understand that writing vulnerable code is a harmful behavior, and having a backdoor that causes harmful behavior is bad - and they decide to hide it, unless the system prompt strongly incentivizes telling the truth.
— Tell me about yourself: LLMs are aware of their learned behaviors
(2501.11120 - Betley et al., 19 Jan 2025) in Appendix: Supplementary results, Do you have a backdoor?