Machine Learning Models that Remember Too Much (1709.07886v1)

Published 22 Sep 2017 in cs.CR and cs.LG

Abstract: Machine learning (ML) is becoming a commodity. Numerous ML frameworks and services are available to data holders who are not ML experts but want to train predictive models on their data. It is important that ML models trained on sensitive inputs (e.g., personal images or documents) not leak too much information about the training data. We consider a malicious ML provider who supplies model-training code to the data holder, does not observe the training, but then obtains white- or black-box access to the resulting model. In this setting, we design and implement practical algorithms, some of them very similar to standard ML techniques such as regularization and data augmentation, that "memorize" information about the training dataset in the model yet the model is as accurate and predictive as a conventionally trained model. We then explain how the adversary can extract memorized information from the model. We evaluate our techniques on standard ML tasks for image classification (CIFAR10), face recognition (LFW and FaceScrub), and text analysis (20 Newsgroups and IMDB). In all cases, we show how our algorithms create models that have high predictive power yet allow accurate extraction of subsets of their training data.

Citations (480)

View on Semantic Scholar

Summary

The paper demonstrates that malicious training algorithms enable ML models to covertly memorize and exfiltrate sensitive data even under black-box conditions.
It introduces novel white-box and black-box methods, including LSB encoding and capacity abuse through data augmentation, to embed information without sacrificing accuracy.
The findings highlight urgent security and ethical concerns, prompting calls for robust auditing and protective measures in ML development.

Machine Learning Models That Remember Too Much

This paper presents an in-depth exploration of how ML models, when trained with malicious algorithms, can covertly memorize and leak information about their training data. The researchers investigate both the theoretical underpinnings and practical implementations of attacks that exploit the inherent memorization capacity of modern ML models, including artificial neural networks, to encode and subsequently exfiltrate sensitive data, even under black-box conditions.

Context and Motivation

The proliferation of ML frameworks allows data holders to easily train predictive models without deep expertise in ML. However, this convenience poses significant privacy risks, especially when training on sensitive data such as personal images or documents. The main threat addressed in this paper is the potential for malicious ML providers to supply adversarial training code, which, without directly observing the training process, can still extract meaningful information from the resulting models.

Attack Methodologies

White-Box Attacks

LSB Encoding: This straightforward method encodes sensitive data directly into the least significant bits (LSBs) of model parameters. Despite the simplicity, it can store substantial information without degrading the model's accuracy, demonstrating that high precision in parameters is often unnecessary.
Correlated Value Encoding: By adding a malicious regularization term to the loss function during training, model parameters can be forced into high correlation with sensitive data, allowing partial reconstruction of training inputs.
Sign Encoding: This technique encodes binary data using the sign of model parameters, exploiting the lack of constraints on parameter signs in typical ML frameworks.

Black-Box Attacks

The paper also introduces black-box attacks utilizing the vast memorization capacity of modern models:

Capacity Abuse through Data Augmentation: This method uses synthetic data known only to the attacker, augmenting the training set in a manner that encodes sensitive information. The ML model, overfitted to extended labels, can then be queried with these synthetic inputs to systematically reveal memorized data.

Evaluation and Results

The authors rigorously evaluate their techniques on a suite of standard ML tasks covering image and text datasets. They demonstrate that malicious models exhibit nearly identical predictive performance to conventional models while leaking significant portions of their training data. For example, a model can reveal 70% of its 10,000-document training corpus through a white-box attack without impacting accuracy.

Implications and Future Directions

The paper underscores the critical privacy risks posed by using third-party ML code, suggesting the urgent need for establishing robust auditing mechanisms and protective measures against such covert data extraction methods. The researchers touch upon some potential countermeasures, such as parameter perturbation and anomaly detection based on parameter distributions. However, these remain areas for further research.

The implications extend beyond practical security concerns, prompting reflections on ethical guidelines and technical standards in the deployment of AI systems. Future research is encouraged to formalize a principle of least privilege for ML, ensuring models capture only the necessary information for their tasks without unintended memorization.

This paper serves as a poignant reminder that while ML technologies advance, so too must the scrutiny and safeguards surrounding their deployment on sensitive data.

PDF Markdown