Definitive metrics for assessing prompt leakage defense effectiveness

Identify and validate definitive, robust metrics or standards for quantifying the effectiveness of system prompt leakage defenses for large language models, beyond proxy measures such as cosine similarity computed with text-embeddings-ada-002, so that protection quality can be reliably assessed across diverse languages and attack types.

Background

The paper demonstrates that response-based defenses relying on text embedding cosine similarity can be subverted, especially via multilingual attacks that exploit weaknesses in embedding models trained primarily on English corpora.

PromptKeeper avoids dependence on such metrics by using hypothesis testing, but the authors still highlight the broader field’s lack of definitive evaluation standards, noting they are unaware of better alternatives than proxy metrics currently used.

References

Nonetheless, cosine similarity evaluated with text-embeddings-ada-002 is not a definitive standard, but merely one of the imperfect proxies we use to empirically assess defense effectiveness, as we are unaware of a more promising alternative (\Cref{sec:setup_metric}).

— Safeguarding System Prompts for LLMs (2412.13426 - Jiang et al., 2024) in Section 6: Evaluation, Defense Effectiveness — Vulnerability of metric-dependent leakage identification

Definitive metrics for assessing prompt leakage defense effectiveness

Background

References

Related Problems