Planting Undetectable Backdoors in Machine Learning Models (2204.06974v2)

Published 14 Apr 2022 in cs.LG and cs.CR

Abstract: Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees. First, we show how to plant a backdoor in any model, using digital signature schemes. The construction guarantees that given black-box access to the original model and the backdoored version, it is computationally infeasible to find even a single input where they differ. This property implies that the backdoored model has generalization error comparable with the original model. Second, we demonstrate how to insert undetectable backdoors in models trained using the Random Fourier Features (RFF) learning paradigm or in Random ReLU networks. In this construction, undetectability holds against powerful white-box distinguishers: given a complete description of the network and the training data, no efficient distinguisher can guess whether the model is "clean" or contains a backdoor. Our construction of undetectable backdoors also sheds light on the related issue of robustness to adversarial examples. In particular, our construction can produce a classifier that is indistinguishable from an "adversarially robust" classifier, but where every input has an adversarial example! In summary, the existence of undetectable backdoors represent a significant theoretical roadblock to certifying adversarial robustness.

Authors (4)

Shafi Goldwasser (21 papers)
Michael P. Kim (17 papers)
Vinod Vaikuntanathan (32 papers)
Or Zamir (20 papers)

Citations (64)

View on Semantic Scholar

Summary

The paper introduces frameworks for inserting undetectable backdoors using digital signature schemes and Random Fourier Features, challenging conventional adversarial defenses.
The paper presents an evaluation-time immunization method that adds random noise to inputs to neutralize covert backdoor triggers during inference.
The research reveals that backdoors can persist through post-training adjustments, emphasizing the necessity for cryptographic verification in secure ML outsourcing.

Overview of "Planting Undetectable Backdoors in Machine Learning Models"

The paper, "Planting Undetectable Backdoors in Machine Learning Models," addresses the significant security risk posed by outsourcing ML tasks to potentially untrusted service providers. The primary focus of the research is on methods by which a malicious trainer can insert undetectable backdoors into supervised learning models. The paper details two frameworks for inserting such backdoors, leveraging cryptographic principles and advanced learning paradigms.

Key Contributions

The core contributions of this paper include:

Definitions and Notions:
- The authors formalize notions of model backdoors, undetectability, and non-replicability. This provides a structured framework for analyzing the security risks associated with outsourcing ML model training.
Backdoor Frameworks:
- Using Digital Signature Schemes: A method is presented to plant backdoors in any model by incorporating digital signature schemes. The result is a backdoored model indistinguishable from the non-backdoored one through black-box access. This leverages the strong unforgeability of digital signatures, ensuring that without knowledge of a secret key, a malicious action remains hidden.

Random Fourier Features (RFF): Another strategy involves using the RFF paradigm where the model uses Random Fourier Features for training. The paper demonstrates that undetectable backdoors can be inserted even under the scrutiny of white-box access, based on the hardness of the Continuous Learning With Errors (CLWE) problem.

Implications for Adversarial Robustness:
- The paper highlights an important theoretical challenge: undetectable backdoors can be successfully embedded into models trained with adversarially-robust algorithms. This suggests a fundamental hurdle for certifying adversarial robustness, since such models can harbor backdoors indistinguishable from genuinely robust ones.
Evaluation-Time Immunization:
- The research proposes a novel approach to neutralizing the effects of backdoors through evaluation-time mechanisms. By adding random noise to inputs, this method aims to disrupt the function of potential backdoors without needing to directly detect them.
Model Persistency:
- It introduces the construction of models resistant to post-training gradient descent, maintaining backdoor functionality despite potential post-processing techniques aimed at modifying the model weights.

Practical and Theoretical Implications

The findings in this paper underscore the pervasive risk of outsourcing ML model training without robust verification mechanisms. Practically, these techniques could present risks in real-world applications where ML models are employed without strict security measures. Theoretically, the results challenge our understanding of model robustness and security, particularly in the context of adversarial machine learning.

Future Directions

Given the inevitable risk of backdoors in ML models as demonstrated by this research, future work is essential in several areas:

Cryptographic Protections: Developing cryptographic techniques, such as zero-knowledge proofs or other verifiable approaches, to certify the "clean training" of models without requiring trust in the trainer.
Enhanced Immunization Techniques: Further exploration of evaluation-time immunization techniques to mitigate backdoors without degrading model performance is crucial.
Robustness and Detection: Continued investigation into methods for both detecting and ensuring robustness against backdoors in different ML contexts is necessary.

This paper opens the door to broad discussions on the trustworthiness of machine learning models and sets the stage for significant future work to enhance the security and reliability of these systems in deployment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/davidmanheim/status/1799717716754079845

https://twitter.com/moyix/status/1745974868783358119

https://twitter.com/suhackerr/status/1745969508865441827

https://twitter.com/niplav_site/status/1912231358413537538

https://twitter.com/bmc_/status/1814858313240354830

https://twitter.com/50782396/status/1739285588111483164

YouTube

Show All Videos