Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PrivateLoRA For Efficient Privacy Preserving LLM (2311.14030v1)

Published 23 Nov 2023 in cs.AI and cs.CR

Abstract: End users face a choice between privacy and efficiency in current LLM service paradigms. In cloud-based paradigms, users are forced to compromise data locality for generation quality and processing speed. Conversely, edge device paradigms maintain data locality but fail to deliver satisfactory performance. In this work, we propose a novel LLM service paradigm that distributes privacy-sensitive computation on edge devices and shared computation in the cloud. Only activations are transmitted between the central cloud and edge devices to ensure data locality. Our core innovation, PrivateLoRA, addresses the challenging communication overhead by exploiting the low rank of residual activations, achieving over 95% communication reduction. Consequently, PrivateLoRA effectively maintains data locality and is extremely resource efficient. Under standard 5G networks, PrivateLoRA achieves throughput over 300% of device-only solutions for 7B models and over 80% of an A100 GPU for 33B models. PrivateLoRA also provides tuning performance comparable to LoRA for advanced personalization. Our approach democratizes access to state-of-the-art generative AI for edge devices, paving the way for more tailored LLM experiences for the general public. To our knowledge, our proposed framework is the first efficient and privacy-preserving LLM solution in the literature.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  2. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  3. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  4. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.
  5. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Ambuj K. Singh, Yizhou Sun, Leman Akoglu, Dimitrios Gunopulos, Xifeng Yan, Ravi Kumar, Fatma Ozcan, and Jieping Ye, editors, Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 5673–5684. ACM, 2023.
  6. Code llama: Open foundation models for code. CoRR, abs/2308.12950, 2023.
  7. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics, 2023.
  8. Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332, 2021.
  9. Palm-e: An embodied multimodal language model. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 8469–8488. PMLR, 2023.
  10. Android in the wild: A large-scale dataset for android device control. CoRR, abs/2307.10088, 2023.
  11. You only look at screens: Multimodal chain-of-action agents. CoRR, abs/2309.11436, 2023.
  12. MLC team. MLC-LLM, 2023.
  13. GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323, 2022.
  14. Do emergent abilities exist in quantized large language models: An empirical study. CoRR, abs/2307.08072, 2023.
  15. Distributed learning of deep neural network over multiple agents. J. Netw. Comput. Appl., 116:1–8, 2018.
  16. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020.
  17. Pytorch FSDP: experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848–3860, 2023.
  18. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  19. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
  20. On the effectiveness of adapter-based tuning for pretrained language model adaptation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2208–2222. Association for Computational Linguistics, 2021.
  21. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1–9. Association for Computational Linguistics, 2022.
  22. Locating and editing factual associations in GPT. In NeurIPS, 2022.
  23. Lora: Low-rank adaptation of large language models, 2021.
  24. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  25. Parameter-efficient transfer learning for NLP. CoRR, abs/1902.00751, 2019.
  26. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In NeurIPS, 2022.
  27. An interpretable federated learning-based network intrusion detection framework. arXiv preprint arXiv:2201.03134, 2022.
  28. Towards fast network intrusion detection based on efficiency-preserving federated learning. In 2021 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), pages 468–475, 2021.
  29. Pockengine: Sparse and efficient fine-tuning in a pocket. CoRR, abs/2310.17752, 2023.
  30. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  31. Efficiently scaling transformer inference. CoRR, abs/2211.05102, 2022.
  32. Qlora: Efficient finetuning of quantized llms. CoRR, abs/2305.14314, 2023.
  33. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2924–2936. Association for Computational Linguistics, 2019.
  34. A framework for few-shot language model evaluation, September 2021.
  35. A survey on 5g networks for the internet of things: Communication technologies and challenges. IEEE Access, 6:3619–3647, 2018.
  36. A first look at commercial 5g performance on smartphones. In Yennun Huang, Irwin King, Tie-Yan Liu, and Maarten van Steen, editors, WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, pages 894–905. ACM / IW3C2, 2020.
  37. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022.
  38. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  39. Stablelm 3b 4e1t.
  40. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
  41. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  42. Privacy for free: How does dataset condensation help privacy? In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5378–5396. PMLR, 2022.
  43. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy, SP 2020, San Francisco, CA, USA, May 18-21, 2020, pages 1314–1331. IEEE, 2020.
  44. Dola: Decoding by contrasting layers improves factuality in large language models. CoRR, abs/2309.03883, 2023.
  45. Mind your heart: Stealthy backdoor attack on dynamic deep neural network in edge computing. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications, pages 1–10, 2023.
  46. Rai 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Responsible identity audit governing the artificial intelligence. In 30th Annual Network and Distributed System Security Symposium, NDSS 2023. The Internet Society, 2023.
  47. Fine-tuning language models with just forward passes. CoRR, abs/2305.17333, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yiming Wang (141 papers)
  2. Yu Lin (50 papers)
  3. Xiaodong Zeng (10 papers)
  4. Guannan Zhang (85 papers)
Citations (8)