Response strategies for regular queries in training-based defenses
Determine appropriate response behaviors for a large language model that has been trained to robustly protect system prompts against adversarial queries when processing regular user queries that may be used for prompt extraction but are indistinguishable from benign inputs, so that privacy protection is maintained without over-rejecting legitimate interactions.
References
(2) Hardness of handling regular queries: even if a model can be trained to robustly protect against adversarial queries, it is unclear how it should respond to regular queries, which might be used for extraction attacks but indistinguishable from benign inputs~\citep{morris2024language, zhang2024extracting, sha2024prompt}.
— Safeguarding System Prompts for LLMs
(2412.13426 - Jiang et al., 18 Dec 2024) in Section 4: Defense via On-Demand Regeneration, Limitations of training-related defense