Effectiveness of cryptic prompt injection for black-box LLMs

Determine the effectiveness of the cryptic prompt injection method optimized via the Greedy Coordinate Gradient algorithm in inducing black-box large language models—specifically the GPT and Gemini families—to insert predefined watermarks (such as a specified citation phrase at the beginning of a generated peer review) when provided with a PDF containing the cryptic prompt, despite the lack of gradient access for optimization.

Background

The paper introduces a cryptic prompt injection technique that uses the Greedy Coordinate Gradient (GCG) algorithm to craft non-obvious text appended to a paper PDF, causing an LLM to insert a predefined watermark in its generated review. Experiments on white-box models (Llama 2 and Vicuna 1.5) show high success rates, indicating that the method can effectively induce watermarks in those settings.

However, the authors note that applying the same approach to black-box LLMs (such as GPT or Gemini) is more challenging due to the lack of gradient access needed by GCG. They suggest possible alternative strategies, like optimizing across multiple white-box models to improve generalization, but explicitly state that whether the approach works for black-box LLMs remains unresolved.

References

In contrast, optimizing cryptic prompt injections for black-box models like the GPT or Gemini families is more challenging due to the lack of gradient access. As a result, the effectiveness of this approach in inducing black-box LLMs to insert watermarks remains an open question.

Detecting LLM-Generated Peer Reviews (2503.15772 - Rao et al., 20 Mar 2025) in Section “Cryptic prompt injection” (Evaluation)