Comet: A Communication-efficient and Performant Approximation for Private Transformer Inference (2405.17485v2)
Abstract: The prevalent use of Transformer-like models, exemplified by ChatGPT in modern language processing applications, underscores the critical need for enabling private inference essential for many cloud-based services reliant on such models. However, current privacy-preserving frameworks impose significant communication burden, especially for non-linear computation in Transformer model. In this paper, we introduce a novel plug-in method Comet to effectively reduce the communication cost without compromising the inference performance. We second introduce an efficient approximation method to eliminate the heavy communication in finding good initial approximation. We evaluate our Comet on Bert and RoBERTa models with GLUE benchmark datasets, showing up to 3.9$\times$ less communication and 3.5$\times$ speedups while keep competitive model performance compared to the prior art.
- EMP-ToolKit URL. https://github.com/emp-toolkit.
- SMU: smooth activation function for deep networks using smoothing maximum technique. In 2022 IEEE CVPR.
- Zvika Brakerski. Fully homomorphic encryption without modulus switching from classical gapsvp. In Proc. CRYPTO, 2012.
- All-or-nothing disclosure of secrets. In Proc. CRYPTO, 1987.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- FLUTE: fast and secure lookup table evaluations. In 2023 IEEE Symposium on Security and Privacy (SP), pages 515–533.
- Homomorphic encryption for arithmetic of approximate numbers. In Proc. ASIACRYPT, 2017.
- Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds. In ASIACRYPT, pages 3–33, 2016.
- Faster cryptonets: Leveraging sparsity for real-world encrypted inference. arXiv preprint arXiv:1811.09953, 2018.
- ABY-A framework for efficient mixed-protocol secure two-party computation. In Proc. NDSS, 2015.
- Improving goldschmidt division, square root, and square root reciprocal. IEEE Transactions on Computers, 49(7):759–763, 2000.
- Somewhat practical fully homomorphic encryption. Cryptology ePrint Archive, 2012.
- Nfgen: Automatic non-linear function evaluation code generator for general-purpose mpc platforms. In Proc. of the 2022 ACM SIGSAC CCS, pages 995–1008.
- Iron: Private inference on transformers. Advances in neural information processing systems, 35:15718–15731, 2022.
- Cheetah: Lean and fast secure Two-Party deep neural network inference. In USENIX Security, pages 809–826, 2022.
- Efficient initial approximation for multiplicative division and square root by a multiplication with operand modification. IEEE Transactions on Computers, 46(4):495–498, 1997.
- Wendy James and P Jarratt. The generation of square roots on a computer with rapid multiplication compared with division. Mathematics of Computation, 19(91):497–500, 1965.
- GAZELLE: A low latency framework for secure neural network inference. In Proc. USENIX Security, 2018.
- William Kahan. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes on the Status of IEEE, 754(94720-1776):11, 1996.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- CrypTen: Secure multi-party computation meets machine learning. Advances in Neural Information Processing Systems, 34:4961–4973, 2021.
- MPCFormer: fast, performant and private Transformer inference with MPC. 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Chris Lomont. Fast inverse square root. Technical report, 2003.
- Secformer: Towards fast and accurate privacy-preserving inference for large language models. arXiv preprint arXiv:2401.00793, 2024.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Delphi: a cryptographic inference system for neural networks. In Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice, pages 27–30, 2020.
- Secureml: A system for scalable privacy-preserving machine learning. In 2017 IEEE symposium on security and privacy (SP), pages 19–38.
- Some properties of iterative square-rooting methods using high-speed multiplication. IEEE Transactions on Computers, 100(8):837–847, 1972.
- SIRNN: A math library for secure rnn inference. In 2021 IEEE Symposium on Security and Privacy (SP), pages 1003–1020.
- CrypTFlow2: Practical 2-Party Secure Inference. In Proc. ACM CCS, 2020.
- XONN: XNOR-based oblivious deep neural network inference. In Proc. USENIX Security, SEC’19, USA, 2019. USENIX Association.
- High-speed inverse square roots. In Proceedings 14th IEEE Symposium on Computer Arithmetic, pages 124–131. IEEE, 1999.
- Adi Shamir. How to share a secret. Communications of the ACM, 22(11):612–613, 1979.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop, pages 353–355.
- Reluplex made more practical: Leaky relu. In 2020 IEEE Symposium on Computers and communications (ISCC), pages 1–7.
- GALA: Greedy ComputAtion for Linear Algebra in Privacy-Preserved Neural Networks. In Proc. NDSS, 2021.