Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Watermark LLM-generated Text via Reinforcement Learning (2403.10553v1)

Published 13 Mar 2024 in cs.LG, cs.AI, and cs.CR

Abstract: We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low - only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: https://github.com/xiaojunxu/learning-to-watermark-LLM .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Performance trade-offs of watermarking large language models. arXiv preprint arXiv:2311.09816, 2023.
  2. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
  3. Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
  4. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  5. Publicly detectable watermarking for language models. arXiv preprint arXiv:2310.18491, 2023.
  6. Three bricks to consolidate watermarks for large language models. In 2023 IEEE International Workshop on Information Forensics and Security (WIFS), pp.  1–6. IEEE, 2023.
  7. On the learnability of watermarks for language models. arXiv preprint arXiv:2312.04469, 2023.
  8. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  9. Semstamp: A semantic watermark with paraphrastic robustness for text generation. arXiv preprint arXiv:2310.03991, 2023.
  10. Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669, 2023.
  11. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023.
  12. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023a.
  13. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023b.
  14. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408, 2023.
  15. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
  16. Who wrote this code? watermarking for code generation. arXiv preprint arXiv:2305.15060, 2023.
  17. A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356, 2023.
  18. Interactive learning from policy-dependent human feedback. In International conference on machine learning, pp.  2285–2294. PMLR, 2017.
  19. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  20. On the fragility of learned reward functions. arXiv preprint arXiv:2301.03652, 2023.
  21. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023.
  22. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp.  1928–1937. PMLR, 2016.
  23. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  24. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  26. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  27. Trust region policy optimization. In International conference on machine learning, pp.  1889–1897. PMLR, 2015.
  28. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  30. Seqxgpt: Sentence-level ai-generated text detection. arXiv preprint arXiv:2310.08903, 2023a.
  31. Detectgpt-sc: Improving detection of text generated by large language models through self-consistency with masked predictions. arXiv preprint arXiv:2310.14479, 2023b.
  32. A survey on llm-gernerated text detection: Necessity, methods, and future directions. arXiv preprint arXiv:2310.14724, 2023.
  33. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization, 2019.
  34. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  35. Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439, 2023.
  36. Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xiaojun Xu (30 papers)
  2. Yuanshun Yao (28 papers)
  3. Yang Liu (2253 papers)
Citations (9)
X Twitter Logo Streamline Icon: https://streamlinehq.com