Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Published 8 May 2025 in cs.LG, cs.AI, cs.CL, and cs.CR | (2505.05190v2)

Abstract: Text watermarking aims to subtly embed statistical signals into text by controlling the LLM's sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

Abstract PDF Chat (Pro)

Summary

Analysis of Vulnerability in Text Watermarking of LLMs through Self-Information Rewrite Attacks

This research paper, authored by Yixin Cheng, Hongcheng Guo, Yangming Li, and Leonid Sigal, investigates the robustness of watermarking in texts generated by LLMs and introduces a novel attack methodology called Self-Information Rewrite Attack (SIRA). This work examines the vulnerabilities within current text watermarking algorithms which embed watermarks in high-entropy tokens to maintain text quality.

Key Contributions

Identification of Vulnerabilities: The paper reveals inherent weaknesses in watermarking algorithms that focus on high-entropy tokens, which are susceptible to exploitation. These vulnerabilities allow for targeted paraphrasing attacks without requiring access to watermark algorithms or the LLM itself.
Introduction of SIRA: The authors propose a new paraphrasing attack method that leverages the weakness of watermark designs by manipulating self-information values of tokens. This attack employs a targeted fill-in-the-blank strategy that efficiently disrupts watermark patterns.
Experimentation and Results: SIRA was subjected to rigorous testing against seven watermarking algorithms, achieving nearly 100% attack success rates at a minimal cost of $0.88 per million tokens. This indicates the efficiency and applicability of SIRA across different models.

Implications

The introduction of SIRA highlights the pressing need for more robust watermarking mechanisms in LLMs. While current watermarking strategies offer some protection against simple attacks, the sophisticated approach of SIRA demonstrates that existing defenses can be circumvented. This necessitates further research and development of watermarking systems that can withstand targeted attacks leveraging self-information and textual entropy manipulations. The paper also posits that future watermarking solutions must balance robustness with maintaining the integrity and quality of generated text.

Theoretical and Practical Prospects

On a theoretical level, this research underscores the importance of adopting new evaluation frameworks for watermark robustness which factor in the dynamics of self-information. Practically, the study suggests implications for the secure deployment of LLMs in environments where content provenance is critical. As the usage of AI-generated text proliferates, algorithms that ensure the accountability and traceability of such outputs are essential.

Future Developments

In predicting the trajectory of future AI developments, this paper suggests several avenues. Firstly, watermark algorithms must evolve to counteract techniques like SIRA. Secondly, security frameworks should incorporate robust detection measures that can handle attacks without sacrificing semantic content preservation. Finally, the paper advocates for continued exploration into alternative watermarking strategies that enhance resilience against adversarial paraphrasing.

In conclusion, this paper provides a compelling analysis of current watermarking practices and successfully introduces a sophisticated method to test their vulnerabilities. As AI systems continue to expand, the research calls for concerted efforts to fortify these systems against emerging threats.