- The paper challenges the view that public pretraining ensures privacy by highlighting risks of sensitive data leakage.
- It critiques standard benchmarks like CIFAR-10 and ImageNet for inflating privacy-preserving claims in DP learning.
- It recommends developing context-aware systems and secure deployment methods to balance model utility with true privacy.
An Examination of Differentially Private Learning with Large-Scale Public Pretraining
In "Considerations for Differentially Private Learning with Large-Scale Public Pretraining," Tramèr, Kamath, and Carlini critically evaluate the interplay between differential privacy (DP) and transfer learning from large publicly-accessible datasets. The authors express skepticism regarding the community's portrayal of publicly pre-trained models as privacy-preserving, emphasizing potential misalignments between perceived and actual privacy implications.
Core Arguments and Analysis
Privacy Implications of Public Pretraining
A primary concern is the misconception that models pre-trained on public data inherently preserve privacy. The paper argues that data scraped from the web, although publicly accessible, frequently includes sensitive information that was either unintentionally disclosed or not intended for such extensive use. This disconnect raises questions about privacy expectations, especially when these models are characterized as "private." The paper cites instances where even well-curated datasets unintentionally include sensitive data that users might expect to remain private, noting potential harms if such data is memorized by models and subsequently leaked.
Utility Concerns: Relevance of Current Benchmarks
The utility of the pretraining-then-finetuning paradigm is another focus. The authors argue that benchmarks traditionally used to evaluate machine learning, such as CIFAR-10 or ImageNet, are not suitable proxies for privacy-sensitive learning. These benchmarks often have significant overlaps between pretraining and private datasets, thereby inflating perceived progress. Such misrepresentations could suggest misleading narratives about the paradigm's effectiveness in real-world privacy-preserving scenarios where the distribution might not match the public datasets' profiles.
Deployment and Trust Issues
Models that benefit most from large-scale public pretraining often require substantial computational resources, precluding on-device deployment. This can result in users having to trust third-party providers with their private data during model interaction, creating a dependency on potentially insecure cloud services. The paper emphasizes that this external dependency may represent a net loss of privacy, as users bear the risk of data exposure during inference.
Discussion: Implications and Forward-Looking Statements
The authors conclude with a call for increased nuance in evaluating the use of public data in private learning contexts. They stress that differential privacy should not be reduced to a binary decision between public and private datasets, urging the development of more contextually-aware learning systems. Suggested future directions include:
- Developing models trained on non-sensitive, consent-secured internet data.
- Establishing benchmarks that accurately reflect tasks requiring sensitive data management.
- Investigating efficient model distillation techniques enabling local deployment without compromising utility.
The paper underlines the critical need for a holistic approach to privacy beyond mere compliance with DP at training time. Privacy should be considered during all stages of model development and deployment, including data collection, model lifecycle management, and user interaction.
Conclusion
This paper presents a thorough critique of the assumptions underpinning current trends in using publicly available data for privately finetuning models under differential privacy guidelines. The considerations it raises are pivotal for ensuring that progress in this field is not only technological but aligns with rigorous standards for user privacy. Such insights are essential as the research community seeks to balance utility, privacy, and practical limits in deploying machine learning models informed by vast, publicly-sourced datasets. These contemplations not only safeguard individual privacy but also fortify the societal trust in privacy-enhancing technologies.