Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation (2504.17480v3)

Published 24 Apr 2025 in cs.CL

Abstract: Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in LLMs. A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.

Summary

The paper introduces CDG-KD, a unified framework enabling scrubbing (removal) and spoofing (forging) attacks on LLM watermarks via knowledge distillation, effective even in black-box settings.
CDG-KD leverages contrastive decoding to manipulate watermark signals in distilled models, training specialized student models for either removing or amplifying watermarks.
Experiments show CDG-KD successfully performs both scrubbing and spoofing attacks while preserving model performance, highlighting the need for more robust and unforgeable LLM watermarking schemes.

Unified Attacks to LLM Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation

In the paper titled "Unified Attacks to LLM Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation," the authors explore vulnerabilities in watermarking techniques applied to LLMs. Watermarking is utilized to combat misinformation and protect intellectual property by distinguishing AI-generated text from human-written content. The phenomenon of watermark radioactivity, where watermarks embedded in teacher models are inherited by student models via knowledge distillation, provides a mechanism for detecting unauthorized distillation. However, the robustness of these watermarks against scrubbing (removal) and spoofing (false attribution) attacks remains largely unexplored.

The paper proposes Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework aimed at executing both scrubbing and spoofing attacks on watermarked LLMs post unauthorized knowledge distillation. CDG-KD leverages contrastive decoding to identify and manipulate watermark signals, thereby enabling indirect attacks on the teacher model by only modifying the distillation outcomes in the student model.

Key Contributions

Unified Framework for Bidirectional Attacks: CDG-KD acts as a versatile framework suitable for scrubbing and spoofing attacks across unauthorized distillation scenarios. The method does not require access to model internals, making it effective in black-box settings.
Contrastive Decoding: This technique contrasts outputs from a student model and a weakly watermarked reference model to generate data that either amplifies or suppresses watermark signals. This process results in training datasets with adjusted watermark characteristics.
Bidirectional Distillation: CDG-KD employs this approach to refine student models further, enhancing their capacity for either removing or forging watermarks. It trains new student models for scrubbing (watermark removal) or spoofing (watermark amplification).
Evaluation of Watermark Robustness: Experimentation shows that CDG-KD maintains the performance of distilled models while effectively performing both types of attacks, underscoring the need for robust and unforgeable watermarking schemes.

Implications

The findings highlight critical vulnerabilities in current watermarking practices, especially under unauthorized distillation scenarios. The ability of watermarks to be manipulated through indirect attacks emphasizes the necessity for developing watermarking techniques that ensure both robustness and unforgeability. The introduction of CDG-KD suggests that traditional watermarking mechanisms may not suffice in providing security against sophisticated attack paradigms that exploit watermark inheritance.

Future Directions

The development of comprehensive defense strategies to withstand scrubbing and spoofing threats is pivotal. Future research could focus on pioneering watermark techniques that integrate multilayered security protocols and real-time detection mechanisms. Moreover, there is potential for developing universally applicable watermark schemes that remain effective under diverse distillation settings, whether in white-box or black-box conditions.

In summary, this paper presents a critical analysis of watermark vulnerabilities in LLMs and proposes a novel attack framework. By demonstrating effective manipulation of watermarks under unauthorized distillation, the paper calls for a reassessment of existing strategies and encourages innovation in watermark robustness and security.

YouTube

Show All Videos