Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger (2206.07136v3)

Published 14 Jun 2022 in cs.LG, cs.CL, cs.CR, and cs.CV

Abstract: Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping threshold R, however, is vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune R for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, showing that it can enjoy an asymptotic convergence rate that matches the standard SGD, under a symmetric gradient noise assumption of the per-sample gradients (commonly used in the non-DP literature). We demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  2. Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems, 31, 2018.
  3. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34, 2021.
  4. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
  5. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018.
  6. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  7. Deep learning with gaussian differential privacy. Harvard data science review, 2020(23), 2020.
  8. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34, 2021.
  9. On the convergence and calibration of deep learning with differential privacy. arXiv preprint arXiv:2106.07830, 2021.
  10. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 74–86, 2018.
  11. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
  12. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE, 2018.
  13. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  14. Understanding gradient clipping in private sgd: A geometric perspective. Advances in Neural Information Processing Systems, 33:13773–13782, 2020.
  15. Momentum improves normalized sgd. In International conference on machine learning, pages 2260–2268. PMLR, 2020.
  16. On the convergence of differentially private federated learning on non-lipschitz objectives, and with normalized client updates. arXiv preprint arXiv:2106.07094, 2021.
  17. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
  18. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  19. Gaussian differential privacy. Journal of the Royal Statistical Society Series B, 84(1):3–37, 2022.
  20. Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
  21. Dp-fp: Differentially private forward propagation for large models. arXiv preprint arXiv:2112.14430, 2021.
  22. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  23. Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language, 59:123–156, January 2020.
  24. Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1–19. Springer, 2008.
  25. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  26. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  27. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  28. Mixed differential privacy in computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8376–8386, 2022.
  29. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  30. Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34, 2021.
  31. Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539, 2022.
  32. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  33. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  34. First quora dataset release: Question pairs, 2017.
  35. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
  36. Differentially private training of residual networks with scale normalisation. arXiv preprint arXiv:2203.00324, 2022.
  37. Computing tight differential privacy guarantees using fft. In International Conference on Artificial Intelligence and Statistics, pages 2560–2569. PMLR, 2020.
  38. Toward training at imagenet scale with differential privacy. arXiv preprint arXiv:2201.12328, 2022.
  39. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  40. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
  41. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2021.
  42. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  43. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2019.
  44. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  45. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  46. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  47. Danilo P Mandic. A generalized normalized gradient descent algorithm. IEEE signal processing letters, 11(2):115–118, 2004.
  48. Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1–35, 2017.
  49. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.
  50. Large scale transfer learning for differentially private image classification. arXiv preprint arXiv:2205.02973, 2022.
  51. Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017.
  52. Diganta Misra. Mish: A self regularized non-monotonic activation function. BMVC 2020, 2019.
  53. Revisiting normalized gradient descent: Fast evasion of saddle points. IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019.
  54. Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
  55. Hyperparameter tuning with renyi differential privacy. In International Conference on Learning Representations, 2021.
  56. Tempered sigmoid activations for deep learning with differential privacy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9312–9321, 2021.
  57. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics.
  58. Adaclip: Adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643, 2019.
  59. Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
  60. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
  61. Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520, 2019.
  62. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.
  63. The 2017 nist language recognition evaluation. In Odyssey, pages 82–89, 2018.
  64. Losing less: A loss for differentially private deep learning. 2021.
  65. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
  66. Don’t decay the learning rate, increase the batch size. In International Conference on Learning Representations, 2018.
  67. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  68. Differentially private learning needs better features (or much more data). In International Conference on Learning Representations, 2020.
  69. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  70. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  71. Subsampled rényi differential privacy and analytical moments accountant. In International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.
  72. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018.
  73. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2020.
  74. Normalized/clipped sgd with perturbation for differentially private non-convex optimization. arXiv preprint arXiv:2206.13033, 2022.
  75. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 6(12):6, 2017.
  76. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020.
  77. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.
  78. Large scale private learning via low-rank reparametrization. In International Conference on Machine Learning, pages 12208–12218. PMLR, 2021.
  79. Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  80. Stochastic normalized gradient descent with momentum for large batch training. arXiv preprint arXiv:2007.13985, 2020.
  81. On the convergence and improvement of stochastic normalized gradient descent. Science China Information Sciences, 64:1–13, 2021.
  82. Optimal accounting of differential privacy via characteristic function. In International Conference on Artificial Intelligence and Statistics, pages 4782–4817. PMLR, 2022.
Citations (51)

Summary

  • The paper introduces an automatic clipping method that eliminates manual threshold tuning in DP-SGD, simplifying privacy-preserving training.
  • It provides a rigorous convergence analysis, showing that AUTO-S achieves asymptotic rates comparable to standard SGD in non-convex settings.
  • The study demonstrates that automatic clipping can outperform state-of-the-art methods in tasks like image classification and NLP by reducing hyperparameter complexity.

Insights into Differentially Private Deep Learning with Automatic Clipping

The paper Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger introduces a novel approach for differentially private (DP) learning by simplifying the gradient clipping process. Authors Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis propose an automatic clipping method that eliminates the need for tuning the clipping threshold RR in differential privacy optimizers.

Key Contributions

The paper's primary contribution is the introduction of automatic clipping mechanisms, specifically AUTO-V and AUTO-S, designed to replace the traditional per-example gradient clipping employed in DP training methods like DP-SGD. Notably, this approach does away with the cumbersome hyperparameter tuning traditionally associated with CDP methods, resulting in a more accessible training process comparable to non-private learning.

  1. Automatic Clipping: The proposed method removes the necessity for the clipping threshold RR, a significant parameter that affects training accuracy under DP settings. Two variants, AUTO-V (vanilla clipping) and AUTO-S (clipping with stability), are introduced. AUTO-S incorporates a stability constant aimed at preserving the magnitude information of gradients and aiding in the convergence of models to stationary points.
  2. Convergence Analysis: The authors provide a rigorous convergence analysis for automatic DP-SGD in non-convex scenarios. AUTO-S, in particular, matches the asymptotic convergence rate of standard SGD, which is a remarkable achievement as it suggests that DP-SGD with automatic clipping can reach zero gradient norms, unlike its traditional counterpart.
  3. Numerical Results: The paper demonstrates that automatic clipping performs equivalently or better than state-of-the-art methods on a range of machine learning tasks, including image classification and natural language processing. It substantially reduces the effort involved in hyperparameter tuning, which is a significant advantage in scaling DP to large datasets and models.

Theoretical and Practical Implications

  1. Simplified Hyperparameter Tuning: The automatic clipping approach makes DP training more straightforward by reducing the dimensions in hyperparameter searches. This is especially beneficial for large models, where tuning can be computationally expensive and time-consuming.
  2. Asymptotic Efficiency: With AUTO-S, the researchers achieved an asymptotic convergence rate comparable to non-DP methods, thereby pushing the boundary on making DP methods more competitive in practical applications.
  3. Model Robustness: By addressing the "lazy region" issue through the stability constant in AUTO-S, the paper contributes to the stability and robustness of gradient descent methods in DP optimization.

Future Directions

The work opens several avenues for future research. One of the promising paths is integrating the automatic clipping technique with other optimizer and architectural adaptations like LoRA or prefix-tuning in transformers, potentially enhancing performance further. Additionally, developing more adaptive methods to determine optimal parameters like the stability constant could further reduce the need for manual tuning and improve the efficiency of DP training.

Conclusion

The authors have presented a compelling case for automatic clipping in differentially private deep learning, providing both theoretical insights and practical benefits. This approach streamlines the training of large models in privacy-sensitive applications, fostering broader adoption of DP techniques without sacrificing usability or performance. As such, this contribution marks a significant step towards making privacy-preserving deep learning more accessible and efficient for a wider range of applications.

Github Logo Streamline Icon: https://streamlinehq.com