Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analysis of Privacy Leakage in Federated Large Language Models (2403.04784v1)

Published 2 Mar 2024 in cs.CR and cs.LG

Abstract: With the rapid adoption of Federated Learning (FL) as the training and tuning protocol for applications utilizing LLMs, recent research highlights the need for significant modifications to FL to accommodate the large-scale of LLMs. While substantial adjustments to the protocol have been introduced as a response, comprehensive privacy analysis for the adapted FL protocol is currently lacking. To address this gap, our work delves into an extensive examination of the privacy analysis of FL when used for training LLMs, both from theoretical and practical perspectives. In particular, we design two active membership inference attacks with guaranteed theoretical success rates to assess the privacy leakages of various adapted FL configurations. Our theoretical findings are translated into practical attacks, revealing substantial privacy vulnerabilities in popular LLMs, including BERT, RoBERTa, DistilBERT, and OpenAI's GPTs, across multiple real-world language datasets. Additionally, we conduct thorough experiments to evaluate the privacy leakage of these models when data is protected by state-of-the-art differential privacy (DP) mechanisms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Multi-freq-LDPy: Multiple frequency estimation under local differential privacy in python. In Computer Security – ESORICS 2022, pages 770–775. Springer Nature Switzerland, 2022. doi: 10.1007/978-3-031-17143-7˙40. URL https://doi.org/10.1007/978-3-031-17143-7_40.
  2. Federated learning for emoji prediction in a mobile keyboard, 2019. URL https://arxiv.org/abs/1906.04329.
  3. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https://aclanthology.org/2022.acl-short.1.
  4. Foundations of Data Science. Cambridge University Press, 2020. doi: 10.1017/9781108755528.
  5. When the curious abandon honesty: Federated learning is not private. arXiv preprint arXiv:2112.02918, 2021.
  6. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  7. Federated user representation learning. arXiv preprint arXiv:1909.12535, 2019.
  8. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE, 2022.
  9. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. URL https://arxiv.org/abs/2003.04807. Data available at https://github.com/PolyAI-LDN/task-specific-datasets.
  10. Federated large language model: A position paper. arXiv preprint arXiv:2307.08925, 2023.
  11. Federated multi-task learning with hierarchical attention for sensor data analytics. 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2020.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  13. Collecting telemetry data privately. Advances in Neural Information Processing Systems, 30, 2017.
  14. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
  15. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC CCS, pages 1054–1067, 2014.
  16. Robbing the fed: Directly obtaining private data in federated learning with modified models. In International Conference on Learning Representations, 2021.
  17. Google. Google Bard, 2023. URL https://bard.google.com/.
  18. Deep models under the gan: Information leakage from collaborative deep learning. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017. URL https://api.semanticscholar.org/CorpusID:5051282.
  19. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/houlsby19a.html.
  20. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  21. Finbert: A large language model for extracting information from financial text. Contemporary Accounting Research, 40(2):806–841, 2023.
  22. Cafe: Catastrophic data leakage in vertical federated learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 994–1006. Curran Associates, Inc., 2021.
  23. Advances and open problems in federated learning. 2019. URL https://arxiv.org/abs/1912.04977.
  24. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-demo.21.
  25. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  26. Using ai-generated suggestions from chatgpt to optimize clinical decision support. Journal of the American Medical Informatics Association, 30(7):1237–1245, 2023.
  27. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
  28. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  29. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2016.
  30. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 739–753. IEEE, 2019. doi: 10.1109/SP.2019.00065. URL https://doi.org/10.1109/SP.2019.00065.
  31. Blockchain-based secure client selection in federated learning. In 2022 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), pages 1–9. IEEE, 2022.
  32. Active membership inference attack under local differential privacy in federated learning. In International Conference on Artificial Intelligence and Statistics, pages 5714–5730. PMLR, 2023.
  33. OpenAI. Chatgpt. https://www.openai.com/chatgpt, 2023. [GPT-3.5].
  34. AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.7. URL https://aclanthology.org/2020.emnlp-demos.7.
  35. Language models are unsupervised multitask learners. 2019.
  36. Hopfield networks is all you need. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=tL89RnzIiCd.
  37. Communication-efficient federated learning for neural machine translation. In NeurIPS 2021 Workshop on Efficient Natural Language and Speech Processing, 2021.
  38. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In NeurIPS EMC2 Workshop, 2019.
  39. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://www.aclweb.org/anthology/D18-1404.
  40. Training keyword spotting models on non-iid data with federated learning. In Interspeech, 2020.
  41. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017.
  42. Large language models encode clinical knowledge. Nature, 620(7972):172—180, August 2023. ISSN 0028-0836. doi: 10.1038/s41586-023-06291-2. URL https://europepmc.org/articles/PMC10396962.
  43. Pretraining federated text models for next word prediction. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2, pages 477–488. Springer, 2021.
  44. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  45. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  46. Can public large language models help private cross-device federated learning? ArXiv, abs/2305.12132, 2023. URL https://api.semanticscholar.org/CorpusID:258833462.
  47. Locally differentially private protocols for frequency estimation. In 26th USENIX Security Symposium (USENIX Security 17), pages 729–745, Vancouver, BC, August 2017. USENIX Association. ISBN 978-1-931971-40-9. URL https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/wang-tianhao.
  48. S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
  49. P3sgd: Patient privacy preserving sgd for regularizing deep cnns in pathological image classification. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2094–2103, 2019. doi: 10.1109/CVPR.2019.00220.
  50. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  51. Applied federated learning: Improving google keyboard query suggestions. CoRR, abs/1812.02903, 2018. URL http://arxiv.org/abs/1812.02903.
  52. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE, 2018.
  53. Federated foundation models: Privacy-preserving and collaborative learning for large models. arXiv preprint arXiv:2305.11414, 2023.
  54. Character-level Convolutional Networks for Text Classification. arXiv:1509.01626 [cs], September 2015.
  55. FedPETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9963–9977, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.632. URL https://aclanthology.org/2023.findings-acl.632.
  56. Deep leakage from gradients. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/60a6c4002cc7b29142def8871531281a-Paper.pdf.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Minh N. Vu (12 papers)
  2. Truc Nguyen (18 papers)
  3. Tre' R. Jeter (3 papers)
  4. My T. Thai (71 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com