Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PUMA: Secure Inference of LLaMA-7B in Five Minutes (2307.12533v3)

Published 24 Jul 2023 in cs.CR

Abstract: With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the model provider. Previous studies have studied secure inference for Transformer models using secure multiparty computation (MPC), where model parameters and clients' prompts are kept secret. Despite this, these frameworks are still limited in terms of model performance, efficiency, and deployment. To address these limitations, we propose framework PUMA to enable fast and secure Transformer model inference. Our framework designs high quality approximations for expensive functions such as GeLU and softmax, and significantly reduce the cost of secure inference while preserving the model performance. Additionally, we design secure Embedding and LayerNorm procedures that faithfully implement the desired functionality without undermining the Transformer architecture. PUMA is about $2\times$ faster than the state-of-the-art framework MPCFORMER(ICLR 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). PUMA can even evaluate LLaMA-7B in around 5 minutes to generate 1 token. To our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under MPC. PUMA has been open-sourced in the Github repository of SecretFlow-SPU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp.  308–318, 2016.
  2. Privformer: Privacy-preserving transformer with mpc. In 2023 IEEE 8th European Symposium on Security and Privacy (EuroSP), pp.  392–410, Los Alamitos, CA, USA, 2023. IEEE Computer Society. doi:10.1109/EuroSP57164.2023.00031. URL https://doi.ieeecomputersociety.org/10.1109/EuroSP57164.2023.00031.
  3. High-throughput semi-honest secure three-party computation with an honest majority. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp.  805–817, 2016.
  4. Language models are few-shot learners, 2020.
  5. Flash: Fast and robust framework for privacy-preserving machine learning. Proc. Priv. Enhancing Technol., 2020(2):459–480, 2020.
  6. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12299–12310, 2021.
  7. Fantastic four: Honest-majority four-party secure computation with malicious security. In 30th {normal-{\{{USENIX}normal-}\}} Security Symposium ({normal-{\{{USENIX}normal-}\}} Security 21), 2021.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
  9. Bootstrapped masked autoencoders for vision bert pretraining. In European Conference on Computer Vision, pp.  247–264. Springer, 2022.
  10. Meteor: Improved secure 3-party neural network inference with reducing online communication costs. In Proceedings of the ACM Web Conference 2023, WWW ’23, pp.  2087–2098, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394161.
  11. How to play any mental game. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC ’87, pp.  218–229, New York, NY, USA, 1987. Association for Computing Machinery. ISBN 0897912217.
  12. Iron: Private inference on transformers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=deyqjpcTfsG.
  13. Cheetah: Lean and fast secure Two-Party deep neural network inference. In 31st USENIX Security Symposium (USENIX Security 22), pp.  809–826, Boston, MA, August 2022. USENIX Association. ISBN 978-1-939133-31-1.
  14. Marcel Keller. Mp-spdz: A versatile framework for multi-party computation. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp.  1575–1590, 2020.
  15. Crypten: Secure multi-party computation meets machine learning. In arXiv 2109.00984, 2021.
  16. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022.
  17. Cryptflow: Secure tensorflow inference. arXiv preprint arXiv:1909.07814, 2019.
  18. MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CWmvjOEhgH-.
  19. Merge: Fast private text generation, 2023.
  20. Oblivious neural network predictions via minionn transformations. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp.  619–631, 2017.
  21. Llms can understand encrypted prompt: Towards privacy-computing friendly transformers, 2023.
  22. Faster secure multiparty computation of adaptive gradient descent. In Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice, PPMLP’20, pp.  47–49, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380881.
  23. SecretFlow-SPU: A performant and User-Friendly framework for Privacy-Preserving machine learning. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pp.  17–33, Boston, MA, July 2023. USENIX Association. ISBN 978-1-939133-35-9. URL https://www.usenix.org/conference/atc23/presentation/ma.
  24. Pointer sentinel mixture models, 2016.
  25. Delphi: A cryptographic inference service for neural networks. In 29th {normal-{\{{USENIX}normal-}\}} Security Symposium ({normal-{\{{USENIX}normal-}\}} Security 20), pp.  2505–2522, 2020.
  26. Aby3: A mixed protocol framework for machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp.  35–52, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356930. doi:10.1145/3243734.3243760. URL https://doi.org/10.1145/3243734.3243760.
  27. Secureml: A system for scalable privacy-preserving machine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pp.  19–38. IEEE, 2017.
  28. Blaze: blazing fast privacy-preserving machine learning. arXiv preprint arXiv:2005.09042, 2020.
  29. {{\{{ABY2. 0}}\}}: Improved {{\{{Mixed-Protocol}}\}} secure {{\{{Two-Party}}\}} computation. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2165–2182, 2021.
  30. Improving language understanding by generative pre-training. 2018.
  31. Cryptflow2: Practical 2-party secure inference. New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370899. URL https://doi.org/10.1145/3372297.3417274.
  32. Sirnn: A math library for secure rnn inference. arXiv preprint arXiv:2105.04236, 2021.
  33. Adi Shamir. How to share a secret. Communications of the ACM, 22(11):612–613, 1979.
  34. Deep learning inference service at microsoft. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pp.  15–17, 2019.
  35. Cryptgpu: Fast privacy-preserving machine learning on the gpu. arXiv preprint arXiv:2104.10949, 2021.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  37. Wouter van Oortmerssen. Flatbuffers: a memory efficient serialization library. Web Page. androiddevelopers. googleblog. com/2014/06/flatbuffers-memory-efficient. html, 2014.
  38. Kenton Varda. Protocol buffers: Google’s data interchange format. Google Open Source Blog, Available at least as early as Jul, 72:23, 2008.
  39. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  40. Securenn: 3-party secure computation for neural network training. Proceedings on Privacy Enhancing Technologies, 2019(3):26–49, 2019.
  41. Falcon: Honest-majority maliciously secure framework for private deep learning. arXiv preprint arXiv:2004.02229, 2020.
  42. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
  43. Characterization of mpc-based private inference for transformer-based models. In 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp.  187–197, 2022. doi:10.1109/ISPASS55109.2022.00025.
  44. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  45. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc.
  46. Andrew Chi-Chih Yao. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science (sfcs 1986), pp.  162–167. IEEE, 1986.
  47. Kaleido-bert: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12647–12657, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ye Dong (10 papers)
  2. Wen-jie Lu (6 papers)
  3. Yancheng Zheng (3 papers)
  4. Haoqi Wu (7 papers)
  5. Derun Zhao (4 papers)
  6. Jin Tan (32 papers)
  7. Zhicong Huang (8 papers)
  8. Cheng Hong (10 papers)
  9. Tao Wei (34 papers)
  10. Wenguang Chen (21 papers)
Citations (37)

Summary

We haven't generated a summary for this paper yet.