Efficient Language Model Architectures for Differentially Private Federated Learning (2403.08100v1)
Abstract: Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices. SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency. However, in centralized training of neural LLMs, adaptive optimizers are preferred as they offer improved stability and performance. In light of this, we ask if LLMs can be modified such that they can be efficiently trained with SGD client optimizers and answer this affirmatively. We propose a scale-invariant Coupled Input Forget Gate (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in the recurrent cell and show that this new model converges faster and achieves better utility than the standard CIFG recurrent model in cross-device FL in large scale experiments. We further show that the proposed scale invariant modification also helps in federated learning of larger transformer models. Finally, we demonstrate the scale invariant modification is also compatible with other non-adaptive algorithms. Particularly, our results suggest an improved privacy utility trade-off in federated learning with differential privacy.
- Federated learning based on dynamic regularization. arXiv preprint arXiv:2111.04263, 2021.
- Qsgd: Communication-efficient sgd via gradient quantization and encoding. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/6c340f25839e6acdc73414517203f5f0-Paper.pdf.
- Federated learning and privacy. Communications of the ACM, 65(4):90–97, 2022.
- Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635–658. Springer, 2016.
- Federated learning of n-gram language models. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 121–130, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/K19-1012. URL https://aclanthology.org/K19-1012.
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1179. URL https://aclanthology.org/D14-1179.
- (amplified) banded matrix factorization: A unified approach to private training. arXiv preprint arXiv:2306.08153, 2023.
- Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, 2006.
- Semi-cyclic stochastic gradient descent. In International Conference on Machine Learning, pp. 1764–1773. PMLR, 2019.
- Differential privacy for deep and federated learning: A survey. IEEE access, 10:22359–22380, 2022.
- Shuffled model of differential privacy in federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2521–2529. PMLR, 2021.
- Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, 2017. doi: 10.1109/TNNLS.2016.2582924.
- Federated learning for mobile keyboard prediction, 2018a. URL https://arxiv.org/abs/1811.03604.
- Federated learning for mobile keyboard prediction. ArXiv, abs/1811.03604, 2018b. URL https://api.semanticscholar.org/CorpusID:53207681.
- Scaling federated learning for fine-tuning of large language models. In International Conference on Applications of Natural Language to Data Bases, 2021.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997a.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997b.
- Advances and open problems in federated learning. 2019. URL https://arxiv.org/abs/1912.04977.
- Practical and private (deep) learning without sampling or shuffling. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5213–5225. PMLR, 18–24 Jul 2021a. URL https://proceedings.mlr.press/v139/kairouz21b.html.
- Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1–2):1–210, 2021b.
- Breaking the centralized barrier for cross-device federated learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 28663–28676. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f0e6be4ce76ccfa73c5a540d992d0756-Paper.pdf.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Federated learning: Strategies for improving communication efficiency. ArXiv, abs/1610.05492, 2016.
- Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020a. doi: 10.1109/MSP.2020.2975749.
- Federated optimization in heterogeneous networks, 2020b.
- Robust training of neural networks using scale invariant architectures. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 12656–12684. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/li22b.html.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273–1282. PMLR, 20–22 Apr 2017a. URL https://proceedings.mlr.press/v54/mcmahan17a.html.
- Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017b.
- On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David McAllester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/pascanu13.html.
- Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-2025.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Training production language models without memorizing user data. arXiv preprint arXiv:2009.10031, 2020.
- Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
- Adaptive Federated Optimization, 2021. URL https://openreview.net/forum?id=LkFG3lB13U5.
- Communication-efficient agnostic federated averaging. In Interspeech, 2021a.
- Scaling language model size in cross-device federated learning. In Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022), pp. 6–20, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.fl4nlp-1.2. URL https://aclanthology.org/2022.fl4nlp-1.2.
- FedJAX: Federated learning simulation with JAX. arXiv preprint arXiv:2108.02117, 2021b.
- Training keyword spotting models on non-iid data with federated learning. In Interspeech, 2020.
- Fast WordPiece tokenization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2089–2103, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.160. URL https://aclanthology.org/2021.emnlp-main.160.
- Distributed mean estimation with limited communication. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3329–3337. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/suresh17a.html.
- TFF. Tensorflow federated, 2018. URL https://www.tensorflow.org/federated.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Federated learning with differential privacy: Algorithms and performance analysis. IEEE transactions on information forensics and security, 15:3454–3469, 2020.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144, 2016.
- Federated learning of gboard language models with differential privacy. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pp. 629–639, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-industry.60. URL https://aclanthology.org/2023.acl-industry.60.
- A review of recurrent neural networks: Lstm cells and network architectures. Neural computation, 31(7):1235–1270, 2019.
- Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
- Private federated learning in gboard. arXiv preprint arXiv:2306.14793, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.