DAWN: Dynamic Adversarial Watermarking of Neural Networks (1906.00830v5)

Published 3 Jun 2019 in cs.CR and stat.ML

Abstract: Training ML models is expensive in terms of computational power, amounts of labeled data and human expertise. Thus, ML models constitute intellectual property (IP) and business value for their owners. Embedding digital watermarks during model training allows a model owner to later identify their models in case of theft or misuse. However, model functionality can also be stolen via model extraction, where an adversary trains a surrogate model using results returned from a prediction API of the original model. Recent work has shown that model extraction is a realistic threat. Existing watermarking schemes are ineffective against IP theft via model extraction since it is the adversary who trains the surrogate model. In this paper, we introduce DAWN (Dynamic Adversarial Watermarking of Neural Networks), the first approach to use watermarking to deter model extraction IP theft. Unlike prior watermarking schemes, DAWN does not impose changes to the training process but it operates at the prediction API of the protected model, by dynamically changing the responses for a small subset of queries (e.g., <0.5%) from API clients. This set is a watermark that will be embedded in case a client uses its queries to train a surrogate model. We show that DAWN is resilient against two state-of-the-art model extraction attacks, effectively watermarking all extracted surrogate models, allowing model owners to reliably demonstrate ownership (with confidence $>1- 2^{-64}$), incurring negligible loss of prediction accuracy (0.03-0.5%).

PDF Abstract

Overview of DAWN: Dynamic Adversarial Watermarking of Neural Networks

The paper introduces a novel approach to watermarking neural networks, known as DAWN (Dynamic Adversarial Watermarking of Neural Networks), specifically designed to address intellectual property theft through model extraction attacks. Model extraction is a technique where an adversary trains a surrogate model by querying a prediction API of the original model, which poses significant risks as it allows theft without direct access to the model itself. Existing watermarking techniques were deemed ineffective against such adversarial scenarios because they traditionally embed watermarks during the training process, which is controlled by the model owner. DAWN presents a significant shift by embedding watermarks at the prediction API instead, ensuring that the watermark persists even if the model's functionality is extracted by an adversary.

Technical Approach

DAWN operates by dynamically modifying a small subset of prediction queries—less than 0.5%—to embed watermarks into any potential surrogate models trained using those API responses. The watermark consists of specific query inputs, whose labels are intentionally altered to serve as triggers in proving ownership. The approach is unique because it does not require changes to the neural network's training process; rather, it involves a deterministic mechanism at the API level utilizing cryptographic principles to generate watermarks consistently and non-intrusively.

This strategy enables DAWN to maintain indistinguishability, performance utility, and unremovability, ensuring that adversaries cannot identify or remove watermarked inputs easily, nor can they degrade the utility of the model used by benign clients.

Evaluation and Results

The evaluation of DAWN demonstrates that it is resilient against two state-of-the-art model extraction attacks, PRADA and KnockOff. In both scenarios, DAWN effectively watermarked all surrogate models with a high degree of confidence in ownership demonstration, measured at greater than $1 - 2^{-64}$ . Furthermore, DAWN achieved this without imposing significant losses in prediction accuracy (ranging from 0.03% to 0.5%), illustrating its utility-preserving design.

The paper also explored various potential adversarial techniques aimed at watermark removal, including double extraction attacks, fine-tuning, pruning, training with noise, and inference with noise. While some approaches can reduce watermark accuracy, they generally incur unacceptable losses in test accuracy, reinforcing DAWN’s robustness in watermark durability.

Implications and Future Directions

DAWN’s resilient watermarking mechanism suggests several implications for practical applications. Most notably, it offers a reliable means for protecting neural networks against extraction attacks—a growing concern given the increasing deployment of AI models as services via prediction APIs. By ensuring the immovability of watermarks and linking them to specific API clients, DAWN provides a trusted method for asserting ownership and potentially identifying perpetrators of model theft.

However, DAWN also points to future research directions, particularly in addressing potential evasion strategies such as distributed attacks involving multiple API clients (Sybil attacks) and optimizing the trade-off between utility and security further. Additionally, the paper's insights inspire further exploration into strengthening watermark resilience against powerful adversaries ready to incur significant accuracy losses as a circumvention strategy.

In conclusion, DAWN represents a significant advancement in securing neural networks against intellectual property theft via extraction attacks, with robust guarantees in watermark resilience, model utility, and ownership verification. As neural networks continue expanding into diverse application domains, mechanisms like DAWN will be critical in safeguarding their proprietary value.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Sebastian Szyller (14 papers)
Buse Gul Atli (5 papers)
Samuel Marchal (12 papers)
N. Asokan (78 papers)

Citations (159)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/ritualnet/status/1853910379920798093

https://twitter.com/testtweeta6088/status/1853899572105355411