Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 83 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

Bypassing LLM Watermarks with Color-Aware Substitutions (2403.14719v1)

Published 19 Mar 2024 in cs.CR, cs.CV, and cs.LG

Abstract: Watermarking approaches are proposed to identify if text being circulated is human or LLM generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (green'') tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the firstcolor-aware'' attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Scott Aaronson. 2022. My ai safety lecture for ut effective altruism. Shtetl-Optimized: The blog of Scott Aaronson. Retrieved on January, 13:2024.
  2. Richard Arratia and Louis Gordon. 1989. Tutorial on large deviations for the binomial distribution. Bulletin of mathematical biology, 51(1):125–131.
  3. Canyu Chen and Kai Shu. 2023. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788.
  4. Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194.
  5. Steven R Dunbar. 2011. The de moivre-laplace central limit theorem. Topics in Probability Theory and Stochastic Processes.
  6. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533.
  7. Towards possibilities & impossibilities of ai-generated text detection: A survey. arXiv preprint arXiv:2310.15264.
  8. Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints.
  9. Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random variables. The collected works of Wassily Hoeffding, pages 409–426.
  10. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  11. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634.
  12. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408.
  13. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593.
  14. Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog. Https://huggingface.co/blog/rlhf.
  15. Thomas Lancaster. 2021. Academic dishonesty or academic integrity? using natural language processing (nlp) techniques to investigate positive integrity in academic integrity research. Journal of Academic Ethics, 19(3):363–383.
  16. Large language models can be guided to evade ai-generated text detection. arXiv preprint arXiv:2305.10847.
  17. Mark my words: Analyzing and evaluating language model watermarks. arXiv preprint arXiv:2312.00273.
  18. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  19. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
  20. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
  21. Red teaming language model detectors with language models. arXiv preprint arXiv:2305.19713.
  22. Baselines for identifying watermarked large language models. arXiv preprint arXiv:2305.18456.
  23. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
  24. A survey on detection of llms-generated content. arXiv preprint arXiv:2310.15654.
  25. Watermarks in the sand: Impossibility of strong watermarking for generative models. arXiv preprint arXiv:2311.04378.
  26. Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439.
  27. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
Citations (10)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube