Breaking Distortion-free Watermarks in Large Language Models (2502.18608v2)

Published 25 Feb 2025 in cs.CR and cs.LG

Abstract: In recent years, LLM watermarking has emerged as an attractive safeguard against AI-generated content, with promising applications in many real-world domains. However, there are growing concerns that the current LLM watermarking schemes are vulnerable to expert adversaries wishing to reverse-engineer the watermarking mechanisms. Prior work in breaking or stealing LLM watermarks mainly focuses on the distribution-modifying algorithm of Kirchenbauer et al. (2023), which perturbs the logit vector before sampling. In this work, we focus on reverse-engineering the other prominent LLM watermarking scheme, distortion-free watermarking (Kuditipudi et al. 2024), which preserves the underlying token distribution by using a hidden watermarking key sequence. We demonstrate that, even under a more sophisticated watermarking scheme, it is possible to compromise the LLM and carry out a spoofing attack, i.e. generate a large number of (potentially harmful) texts that can be attributed to the original watermarked LLM. Specifically, we propose using adaptive prompting and a sorting-based algorithm to accurately recover the underlying secret key for watermarking the LLM. Our empirical findings on LLAMA-3.1-8B-Instruct, Mistral-7B-Instruct, Gemma-7b, and OPT-125M challenge the current theoretical claims on the robustness and usability of the distortion-free watermarking techniques.

Authors (8)

Shayleen Reynolds (3 papers)
Saheed Obitayo (4 papers)
Niccolò Dalmasso (32 papers)
Dung Daniel T. Ngo (1 paper)
Vamsi K. Potluru (28 papers)
Manuela Veloso (105 papers)
Hengzhi He (7 papers)
Guang Cheng (136 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Breaking Distortion-free Watermarks in Large Language Models (2502.18608v2)

Summary

Related Papers