FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference (2405.16241v1)
Abstract: With the fast evolution of LLMs, privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., Cheetah, Iron, and Bumblebee, FastQuery achieves more than $4.3\times$, $2.7\times$, $1.3\times$ latency reduction, respectively and more than $75.7\times$, $60.2\times$, $20.2\times$ communication reduction, respectively, on both LLAMA-7B and LLAMA-30B.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Ezpc: Programmable, efficient, and scalable secure two-party computation for machine learning. Cryptology ePrint Archive, 2017.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Characterizing and optimizing end-to-end systems for private inference. In ACM ASPLOS, pages 89–104, 2023.
- Iron: Private inference on transformers. In Advances in Neural Information Processing Systems, 2022.
- Ciphergpt: Secure two-party gpt inference. Cryptology ePrint Archive, 2023.
- Cheetah: Lean and fast secure two-party deep neural network inference. In USENIX Security, pages 809–826, 2022.
- Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
- Bumblebee: Secure two-party inference framework for large transformers. Cryptology ePrint Archive, 2023.
- SecretFlow-SPU: A performant and User-Friendly framework for Privacy-Preserving machine learning. In USENIX ATC, pages 17–33, Boston, MA, July 2023. USENIX Association.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Delphi: A cryptographic inference service for neural networks, Jan 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Sirnn: A math library for secure rnn inference. In 2021 IEEE Symposium on Security and Privacy (SP), pages 1003–1020. IEEE, 2021.
- Cryptflow2: Practical 2-party secure inference. In ACM SIGSAC CCS, 2020.
- Cham: A customized homomorphic encryption accelerator for fast matrix-vector product. In DAC. IEEE, 2023.
- Microsoft SEAL (release 3.6). https://github.com/Microsoft/SEAL, November 2020. Microsoft Research, Redmond, WA.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Hequant: Marrying homomorphic encryption and quantization for communication-efficient private inference. arXiv preprint arXiv:2401.15970, 2024.
- Falcon: Accelerating homomorphically encrypted convolutions for efficient private mobile network inference. arXiv preprint arXiv:2308.13189, 2023.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. NeurIPS, 35:27168–27183, 2022.