Effectiveness of cryptic prompt injection for black-box LLMs
Determine the effectiveness of the cryptic prompt injection method optimized via the Greedy Coordinate Gradient algorithm in inducing black-box large language models—specifically the GPT and Gemini families—to insert predefined watermarks (such as a specified citation phrase at the beginning of a generated peer review) when provided with a PDF containing the cryptic prompt, despite the lack of gradient access for optimization.
Sponsor
References
In contrast, optimizing cryptic prompt injections for black-box models like the GPT or Gemini families is more challenging due to the lack of gradient access. As a result, the effectiveness of this approach in inducing black-box LLMs to insert watermarks remains an open question.
— Detecting LLM-Generated Peer Reviews
(2503.15772 - Rao et al., 20 Mar 2025) in Section “Cryptic prompt injection” (Evaluation)