Differentially Private Optimization on Large Model at Small Cost (2210.00038v2)

Published 30 Sep 2022 in cs.LG, cs.CL, cs.CR, and cs.CV

Abstract: Differentially private (DP) optimization is the standard paradigm to learn large neural networks that are accurate and privacy-preserving. The computational cost for DP deep learning, however, is notoriously heavy due to the per-sample gradient clipping. Existing DP implementations are 2-1000X more costly in time and space complexity than the standard (non-private) training. In this work, we develop a novel Book-Keeping (BK) technique that implements existing DP optimizers (thus achieving the same accuracy), with a substantial improvement on the computational cost. Specifically, BK enables DP training on large models and high dimensional data to be roughly as fast and memory-saving as the standard training, whereas previous DP algorithms can be inefficient or incapable of training due to memory error. The computational advantage of BK is supported by the complexity analysis as well as extensive experiments on vision and language tasks. Our implementation achieves state-of-the-art (SOTA) accuracy with very small extra cost: on GPT2 and at almost the same memory cost (<1% overhead), BK has 1.03X the time complexity of the standard training (0.83X training speed in practice), and 0.61X the time complexity of the most efficient DP implementation (1.36X training speed in practice). We open-source the codebase for the BK algorithm at the FastDP library (https://github.com/awslabs/fast-differential-privacy).

PDF Abstract

Differentially Private Optimization on Large Models at Small Cost

The paper "Differentially Private Optimization on Large Models at Small Cost" introduces an efficient technique for training large neural networks with differential privacy (DP). The primary focus is to address the challenges posed by the high computational overhead associated with DP, particularly in terms of time and memory resources. The authors propose the Book-Keeping (BK) algorithm, which significantly reduces these costs while maintaining state-of-the-art accuracy.

Overview

Differential privacy, as defined by Dwork et al., ensures that the inclusion or exclusion of a single training sample does not significantly affect the output of a model. This property is particularly important when using sensitive data, as it minimizes the risks of data leakage. Traditional DP methods, such as DP-SGD, incur substantial costs. Specifically, they require additional computations for per-sample gradient clipping and noise addition, leading to slowdowns by factors of 2-1000X, depending on model size and data complexity.

Contributions

Algorithmic Efficiency: The core contribution of the work is the BK algorithm, which focuses on minimizing the computational overhead typically caused by per-sample gradient operations. BK achieves this through two main innovations:

Ghost Norm Trick: This avoids instantiating per-sample gradients by computing their norms without direct computation. This technique is leveraged from previous works but optimized within BK for efficiency.
Book-Keeping and Ghost Differentiation Tricks: By cleverly re-using intermediate results (output gradients) during training and avoiding unnecessary computations, BK manages to bypass the need for a second back-propagation pass required by prior methods like GhostClipping.

Together, these optimizations enable BK to match the training speed and memory usage of non-DP models more closely, which was previously unattainable in the DP context.

Numerical Results and Implications

The authors report strong numerical results demonstrating that BK achieves near-identical training speed to non-private models, with memory overheads reduced to less than 1%. For instance, BK has been applied to large models such as GPT-2 and various ResNet configurations, showing efficiency improvements over existing methods. These results are presented as thorough benchmarks on both NLP and vision datasets, such as GPT2 on the E2E dataset and vision transformers on CIFAR10/100 and ImageNet.

Practical and Theoretical Implications

The practical impact of this work is immediate and significant. By reducing the cost of DP training, it becomes feasible to train large models on sensitive datasets without prohibitive resource demands. This opens up opportunities for industries dealing with sensitive data—such as healthcare and finance—to utilize state-of-the-art machine learning with robust privacy guarantees.

Theoretically, BK shifts the paradigm by demonstrating that DP constraints do not inherently necessitate substantial computational penalties. This naturally leads to questions about where else such efficiency improvements can be realized within DP frameworks and whether similar strategies can be applied to other domains of privacy-preserving machine learning.

Future Perspectives

The success of BK sets the stage for further research into optimizing DP efficiency. Potential areas of exploration include extending the methodology to other types of layers beyond generalized linear layers and exploring hybrid models where BK can be combined with other parameter-efficient techniques like LoRA and adapters.

Another promising direction is examining the impact and possibilities of BK in federated learning scenarios, where the trade-offs between model accuracy, privacy, and resource constraints are even more pronounced due to the distributed nature of data processing.

Overall, the paper advances the field by demonstrating that efficient DP training for large models is achievable and highly beneficial, providing a foundation for more scalable and privacy-conscious AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Zhiqi Bu (42 papers)
Yu-Xiang Wang (124 papers)
Sheng Zha (25 papers)
George Karypis (110 papers)

Citations (41)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - awslabs/fast-differential-privacy (121 stars)