Entangled Watermarks as a Defense against Model Extraction (2002.12200v2)

Published 27 Feb 2020 in cs.CR and stat.ML

Abstract: Machine learning involves expensive data collection and training procedures. Model owners may be concerned that valuable intellectual property can be leaked if adversaries mount model extraction attacks. As it is difficult to defend against model extraction without sacrificing significant prediction accuracy, watermarking instead leverages unused model capacity to have the model overfit to outlier input-output pairs. Such pairs are watermarks, which are not sampled from the task distribution and are only known to the defender. The defender then demonstrates knowledge of the input-output pairs to claim ownership of the model at inference. The effectiveness of watermarks remains limited because they are distinct from the task distribution and can thus be easily removed through compression or other forms of knowledge transfer. We introduce Entangled Watermarking Embeddings (EWE). Our approach encourages the model to learn features for classifying data that is sampled from the task distribution and data that encodes watermarks. An adversary attempting to remove watermarks that are entangled with legitimate data is also forced to sacrifice performance on legitimate data. Experiments on MNIST, Fashion-MNIST, CIFAR-10, and Speech Commands validate that the defender can claim model ownership with 95\% confidence with less than 100 queries to the stolen copy, at a modest cost below 0.81 percentage points on average in the defended model's performance.

Authors (4)

Hengrui Jia (9 papers)
Christopher A. Choquette-Choo (49 papers)
Varun Chandrasekaran (39 papers)
Nicolas Papernot (123 papers)

Citations (198)

View on Semantic Scholar

Summary

Entangled Watermarking Embeddings: Enhancing Intellectual Property Protection for Machine Learning Models

The protection of intellectual property (IP) in ML models is an area of significant concern, particularly given the substantial resources often invested in data collection and model training. Conventional ML model deployment creates vulnerabilities to model extraction attacks, where adversaries can replicate models by querying them. Watermarking has emerged as a noteworthy defensive strategy in this domain, providing rights holders the means to establish IP without degrading model accuracy. This paper presents a novel method, Entangled Watermarking Embeddings (EWE), designed to integrate watermarks directly into model representations while maintaining high practical efficacy and robustness against extraction efforts.

Conceptual Framework and Methodology

Traditional watermarking involves embedding outlier input-output pairs, known explicitly to the defender, into the model. Through demonstrations of knowledge concerning these pairs, defenders can assert ownership. However, these watermarks, due to their deviation from task-specific data distributions, are susceptible to removal by adversaries proficient in model compressions or knowledge transfer techniques.

EWE addresses this vulnerability by entangling watermark representations with the features necessary for task classification. This coupling means that attempts to delete the watermark would also degrade the model's legitimate performance. This is achieved by implementing the technique known as Soft Nearest Neighbor Loss (SNNL) that forces watermarked inputs and legitimate data to share overlapping feature representations. The authors demonstrate that this method effectively creates an inseparable linkage between task-specific data and watermark data, making attacks that unlink them detrimental to the model's functional utility.

Experimental Validation

The validation of EWE was performed across several datasets, including MNIST, Fashion MNIST, CIFAR-10, CIFAR-100, and Speech Commands, to verify robustness against model extraction attacks. The results indicate that model ownership can be asserted with 95% confidence using fewer than 100 queries, with a minor average accuracy trade-off below 0.81%. In contrast to baseline techniques, EWE maintained a significantly higher watermark success rate post extraction, often averaging above 38%, which confirms its robustness and effectiveness.

Key Findings and Implications

Superior Resistance to Extraction: EWE models exhibited superior resistance to adversaries, with extracted models retaining high correctness in watermark retrieval.
Minimal Performance Degradation: The method imposes negligible accuracy losses for in-distribution tasks, making it a practical choice for real-world applications.
Scalability and Versatility: This approach extends beyond image datasets into audio domains, indicating its robustness across multiple sensory data types.
Strategic Entanglement Increases Robustness: The entanglement of watermarks with legitimate data characteristics increases resistance not only to simple extraction but also to more sophisticated adaptive attacks and backdoor defenses.

Future Directions

The current paper opens multiple avenues for further exploration, particularly in scaling the technique to more complex model architectures and larger datasets. The implication of adversarial designs, informed by optimal choice of watermark data, is another promising direction to enhance entanglement efficacy. Moreover, the refinement of hyperparameter tuning methods, particularly concerning temperature and weight factors, could further enhance watermark robustness without compromising classification accuracy.

In conclusion, EWE stands as a promising advancement in the safeguarding of machine learning models against piracy. It affirms the concept that entangling legitimate data representations with watermarking can effectively deter and complicate model extraction and theft while maintaining the model's primary functional objectives.

PDF Markdown