Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SentinelLMs: Encrypted Input Adaptation and Fine-tuning of Language Models for Private and Secure Inference (2312.17342v1)

Published 28 Dec 2023 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: This paper addresses the privacy and security concerns associated with deep neural LLMs, which serve as crucial components in various modern AI-based applications. These models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. However, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. To address this, we propose a novel method to adapt and fine-tune transformer-based LLMs on passkey-encrypted user-specific text. The original pre-trained LLM first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. This enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. After adaptation, models are fine-tuned on encrypted versions of existing training datasets. Experimental evaluation employing adapted versions of renowned models (e.g., BERT, RoBERTa) across established benchmark English and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. This serves to safeguard performance, privacy, and security cohesively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. BLAKE2: simpler, smaller, fast as MD5. In Applied Cryptography and Network Security: 11th International Conference, ACNS 2013, Banff, AB, Canada, June 25-28, 2013. Proceedings 11, 119–135. Springer.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  3. Man-in-the-Middle Attack to the HTTPS Protocol. IEEE Security & Privacy, 7(1): 78–81.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. XNLI: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053.
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  7. Implementation of Blake 256 hash function for password encryption and parallel CRC. In 2015 Online International Conference on Green Engineering and Technologies (IC-GET), 1–4. IEEE.
  8. Informed consent: it’s not just signing a form. Thoracic surgery clinics, 15(4): 451–460.
  9. Multidimensional scaling. 11. Sage.
  10. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  11. Privacy-Preserving Text Classification on BERT Embeddings with Homomorphic Encryption. arXiv preprint arXiv:2210.02574.
  12. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3: 211–225.
  13. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273.
  14. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  15. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  16. O’Neill, O. 2003. Some limits of informed consent. Journal of medical ethics, 29(1): 4–7.
  17. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2): 604–624.
  18. On the secure hash algorithm family. Cryptography in context, 1–18.
  19. Natural language understanding with privacy-preserving bert. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 1488–1497.
  20. Improving language understanding by generative pre-training. Technical report, OpenAI.
  21. Raeini, M. 2023. Privacy-preserving large language models (PPLLMs). Available at SSRN 4512071.
  22. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
  23. Membership privacy for machine learning models through knowledge transfer. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 9549–9557.
  24. Byte Pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University.
  25. Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, 377–390.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  27. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  28. Private model compression via knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 1190–1197.
  29. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Abhijit Mishra (20 papers)
  2. Mingda Li (95 papers)
  3. Soham Deo (1 paper)
Citations (2)

Summary

We haven't generated a summary for this paper yet.