Response strategies for regular queries in training-based defenses

Determine appropriate response behaviors for a large language model that has been trained to robustly protect system prompts against adversarial queries when processing regular user queries that may be used for prompt extraction but are indistinguishable from benign inputs, so that privacy protection is maintained without over-rejecting legitimate interactions.

Background

The paper critiques training-time defenses (e.g., supervised fine-tuning or RLHF) for safeguarding system prompts, noting a lack of guarantees and potential capability degradation. A key unresolved issue the authors highlight is how such defenses should handle regular queries, which attackers can use for extraction but which appear indistinguishable from normal user requests.

This gap arises because regular queries are pervasive and not reliably separable from benign usage, making response strategy design an unsolved challenge for training-based approaches that aim to prevent prompt leakage while preserving usability.

References

(2) Hardness of handling regular queries: even if a model can be trained to robustly protect against adversarial queries, it is unclear how it should respond to regular queries, which might be used for extraction attacks but indistinguishable from benign inputs~\citep{morris2024language, zhang2024extracting, sha2024prompt}.

— Safeguarding System Prompts for LLMs (2412.13426 - Jiang et al., 2024) in Section 4: Defense via On-Demand Regeneration, Limitations of training-related defense

Response strategies for regular queries in training-based defenses

Sponsor

Background

References

Related Problems