Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Published 13 Dec 2022 in cs.LG, cs.CR, and stat.ML | (2212.06470v3)

Abstract: The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach. We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy. Beyond the privacy considerations of using public data, we further question the utility of this paradigm. We scrutinize whether existing machine learning benchmarks are appropriate for measuring the ability of pretrained models to generalize to sensitive domains, which may be poorly represented in public Web data. Finally, we notice that pretraining has been especially impactful for the largest available models -- models sufficiently large to prohibit end users running them on their own devices. Thus, deploying such models today could be a net loss for privacy, as it would require (private) data to be outsourced to a more compute-powerful third party. We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (57)

View on Semantic Scholar

Summary

The paper challenges the view that public pretraining ensures privacy by highlighting risks of sensitive data leakage.
It critiques standard benchmarks like CIFAR-10 and ImageNet for inflating privacy-preserving claims in DP learning.
It recommends developing context-aware systems and secure deployment methods to balance model utility with true privacy.

An Examination of Differentially Private Learning with Large-Scale Public Pretraining

In "Considerations for Differentially Private Learning with Large-Scale Public Pretraining," Tramèr, Kamath, and Carlini critically evaluate the interplay between differential privacy (DP) and transfer learning from large publicly-accessible datasets. The authors express skepticism regarding the community's portrayal of publicly pre-trained models as privacy-preserving, emphasizing potential misalignments between perceived and actual privacy implications.

Core Arguments and Analysis

Privacy Implications of Public Pretraining

A primary concern is the misconception that models pre-trained on public data inherently preserve privacy. The paper argues that data scraped from the web, although publicly accessible, frequently includes sensitive information that was either unintentionally disclosed or not intended for such extensive use. This disconnect raises questions about privacy expectations, especially when these models are characterized as "private." The paper cites instances where even well-curated datasets unintentionally include sensitive data that users might expect to remain private, noting potential harms if such data is memorized by models and subsequently leaked.

Utility Concerns: Relevance of Current Benchmarks

The utility of the pretraining-then-finetuning paradigm is another focus. The authors argue that benchmarks traditionally used to evaluate machine learning, such as CIFAR-10 or ImageNet, are not suitable proxies for privacy-sensitive learning. These benchmarks often have significant overlaps between pretraining and private datasets, thereby inflating perceived progress. Such misrepresentations could suggest misleading narratives about the paradigm's effectiveness in real-world privacy-preserving scenarios where the distribution might not match the public datasets' profiles.

Deployment and Trust Issues

Models that benefit most from large-scale public pretraining often require substantial computational resources, precluding on-device deployment. This can result in users having to trust third-party providers with their private data during model interaction, creating a dependency on potentially insecure cloud services. The paper emphasizes that this external dependency may represent a net loss of privacy, as users bear the risk of data exposure during inference.

Discussion: Implications and Forward-Looking Statements

The authors conclude with a call for increased nuance in evaluating the use of public data in private learning contexts. They stress that differential privacy should not be reduced to a binary decision between public and private datasets, urging the development of more contextually-aware learning systems. Suggested future directions include:

Developing models trained on non-sensitive, consent-secured internet data.
Establishing benchmarks that accurately reflect tasks requiring sensitive data management.
Investigating efficient model distillation techniques enabling local deployment without compromising utility.

The paper underlines the critical need for a holistic approach to privacy beyond mere compliance with DP at training time. Privacy should be considered during all stages of model development and deployment, including data collection, model lifecycle management, and user interaction.

Conclusion

This paper presents a thorough critique of the assumptions underpinning current trends in using publicly available data for privately finetuning models under differential privacy guidelines. The considerations it raises are pivotal for ensuring that progress in this field is not only technological but aligns with rigorous standards for user privacy. Such insights are essential as the research community seeks to balance utility, privacy, and practical limits in deploying machine learning models informed by vast, publicly-sourced datasets. These contemplations not only safeguard individual privacy but also fortify the societal trust in privacy-enhancing technologies.

Markdown Report Issue