Definitive metrics for assessing prompt leakage defense effectiveness
Identify and validate definitive, robust metrics or standards for quantifying the effectiveness of system prompt leakage defenses for large language models, beyond proxy measures such as cosine similarity computed with text-embeddings-ada-002, so that protection quality can be reliably assessed across diverse languages and attack types.
References
Nonetheless, cosine similarity evaluated with text-embeddings-ada-002 is not a definitive standard, but merely one of the imperfect proxies we use to empirically assess defense effectiveness, as we are unaware of a more promising alternative (\Cref{sec:setup_metric}).
— Safeguarding System Prompts for LLMs
(2412.13426 - Jiang et al., 18 Dec 2024) in Section 6: Evaluation, Defense Effectiveness — Vulnerability of metric-dependent leakage identification