Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Privacy Risks of Embeddings Induced by Large Language Models (2404.16587v1)

Published 25 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs show early signs of artificial general intelligence but struggle with hallucinations. One promising solution to mitigate these hallucinations is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation. However, such a solution risks compromising privacy, as recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained LLMs. The significant advantage of LLMs over traditional pre-trained models may exacerbate these concerns. To this end, we investigate the effectiveness of reconstructing original knowledge and predicting entity attributes from these embeddings when LLMs are employed. Empirical findings indicate that LLMs significantly improve the accuracy of two evaluated tasks over those from pre-trained models, regardless of whether the texts are in-distribution or out-of-distribution. This underscores a heightened potential for LLMs to jeopardize user privacy, highlighting the negative consequences of their widespread use. We further discuss preliminary strategies to mitigate this risk.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. [n. d.]. The Cohere Platform. https://docs.cohere.com/docs/.
  2. [n. d.]. English transformer pipeline. https://huggingface.co/spacy/en_core_web_trf/.
  3. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  4. Raed Z Al-Abdallah and Ahmad T Al-Taani. 2017. Arabic single-document text summarization using particle swarm optimization algorithm. Procedia Computer Science 117 (2017), 30–37.
  5. A novel machine learning approach for sentiment analysis on Twitter incorporating the universal language model fine-tuning and SVM. Applied System Innovation 5, 1 (2022), 13.
  6. Konstantinos Andriopoulos and Johan Pouwelse. 2023. Augmenting LLMs with Knowledge: A survey on hallucination prevention. arXiv preprint arXiv:2309.16459 (2023).
  7. Agnes Axelsson and Gabriel Skantze. 2023. Using large language models for zero-shot natural language generation from knowledge graphs. arXiv preprint arXiv:2307.07312 (2023).
  8. Jerome R Bellegarda. 2004. Statistical language model adaptation: review and perspectives. Speech communication 42, 1 (2004), 93–108.
  9. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. arXiv preprint arXiv:2305.07507 (2023).
  10. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
  11. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530 (2023).
  12. Martin Coulter and Greg Bensinger. 2023. Alphabet shares dive after Google AI chatbot Bard flubs answer in ad. Reuters (2023).
  13. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092 (2023).
  14. Structured information extraction from scientific text with large language models. Nature Communications 15, 1 (2024), 1418.
  15. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language 30, 1 (2015), 61–98.
  16. David B Duncan. 1975. T tests and intervals for comparisons suggested by the data. Biometrics (1975), 339–359.
  17. Oluwaseyi Feyisetan and Shiva Kasiviswanathan. 2021. Private release of text embedding vectors. In Proceedings of the First Workshop on Trustworthy Natural Language Processing. 15–27.
  18. Markus Freitag and Yaser Al-Onaizan. 2017. Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017).
  19. Tal Friedman and Guy Broeck. 2020. Symbolic querying of vector spaces: Probabilistic databases meets relational embeddings. In Conference on Uncertainty in Artificial Intelligence. PMLR, 1268–1277.
  20. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
  21. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Empirical Methods in Natural Language Processing (EMNLP).
  22. Ben Goertzel. 2014. Artificial general intelligence: concept, state of the art, and future prospects. Journal of Artificial General Intelligence 5, 1 (2014), 1.
  23. Manu: a cloud native vector database management system. arXiv preprint arXiv:2206.13843 (2022).
  24. news-please: A Generic News Crawler and Extractor. In Proceedings of the 15th International Symposium of Information Science (Berlin). 218–223. https://doi.org/10.5281/zenodo.4120316
  25. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  26. Using a stochastic context-free grammar as a language model for speech recognition. In 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. IEEE, 189–192.
  27. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169 (2023).
  28. Adam Kilgarriff. 2001. Comparing corpora. International journal of corpus linguistics 6, 1 (2001), 97–133.
  29. Generating Text from Structured Data with Application to the Biography Domain. CoRR abs/1603.07771 (2016). arXiv:1603.07771 http://arxiv.org/abs/1603.07771
  30. Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence. arXiv preprint arXiv:2305.03010 (2023).
  31. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6449–6464.
  32. Topic modeling on triage notes with semiorthogonal nonnegative matrix factorization. J. Amer. Statist. Assoc. 116, 536 (2021), 1609–1624.
  33. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  34. Shared-private bilingual word embeddings for neural machine translation. arXiv preprint arXiv:1906.03100 (2019).
  35. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35 (2022), 2507–2521.
  36. Extensions of recurrent neural network language model. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5528–5531.
  37. Recent advances in natural language processing via large pre-trained language models: A survey. Comput. Surveys 56, 2 (2023), 1–40.
  38. Text embeddings reveal (almost) as much as text. arXiv preprint arXiv:2310.06816 (2023).
  39. Evaluating transformer language models on arithmetic operations using number decomposition. arXiv preprint arXiv:2304.10977 (2023).
  40. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022).
  41. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  42. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014 (2023).
  43. A Survey of Text Representation and Embedding Techniques in NLP. IEEE Access (2023).
  44. Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572, 7767 (2019), 106–111.
  45. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  46. Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE 88, 8 (2000), 1270–1278.
  47. Diego Saez-Trumper and Miriam Redi. 2020. Wikimedia Public (Research) Resources. In Companion Proceedings of the Web Conference 2020. 311–312.
  48. On the use of grammar based language models for statistical machine translation. In Proceedings of the Sixth International Workshop on Parsing Technologies. 231–241.
  49. S Selva Birunda and R Kanniga Devi. 2021. A review on word embedding techniques for text classification. Innovative Data Communication Technologies and Application: Proceedings of ICIDCA 2020 (2021), 267–281.
  50. Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security. 377–390.
  51. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533 (2022).
  52. A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review 55, 7 (2022), 5731–5780.
  53. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]
  54. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219 (2023).
  55. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774 (2021).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhihao Zhu (11 papers)
  2. Ninglu Shao (9 papers)
  3. Defu Lian (142 papers)
  4. Chenwang Wu (13 papers)
  5. Zheng Liu (312 papers)
  6. Yi Yang (855 papers)
  7. Enhong Chen (242 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets