Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learnable Privacy Neurons Localization in Language Models (2405.10989v1)

Published 16 May 2024 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: Concerns regarding LLMs to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Large-scale differentially private bert. arXiv preprint arXiv:2108.01624.
  2. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876–10889.
  3. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  4. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  5. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  7. Knowledgeable or educated guess? revisiting language models as knowledge bases. arXiv preprint arXiv:2106.09231.
  8. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
  9. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
  10. Neural legal judgment prediction in english. arXiv preprint arXiv:1906.02059.
  11. Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150.
  12. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
  13. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association, 24(3):596–606.
  14. Sensitive data detection and classification in spanish clinical text: Experiments with bert. arXiv preprint arXiv:2003.03106.
  15. Threats to pre-trained language models: Survey and taxonomy. arXiv preprint arXiv:2202.06862.
  16. Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539.
  17. Benjamin Heinzerling and Kentaro Inui. 2020. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. arXiv preprint arXiv:2008.09036.
  18. Learning and evaluating a differentially private pre-trained language model. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1178–1189.
  19. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628.
  20. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099.
  21. Towards continual knowledge learning of language models. arXiv preprint arXiv:2110.03215.
  22. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504.
  23. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR.
  24. Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4360–4379.
  25. Bryan Klimt and Yiming Yang. 2004. Introducing the enron corpus. In CEAS, volume 45, pages 92–96.
  26. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
  27. Pmet: Precise model editing in a transformer. arXiv preprint arXiv:2308.08742.
  28. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679.
  29. Anonymisation models for text data: State of the art, challenges and future directions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4188–4203.
  30. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312.
  31. Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539.
  32. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
  33. Can neural network memorization be localized?
  34. Differentially private decoding in large language models. arXiv preprint arXiv:2205.13621.
  35. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  36. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
  37. OpenAI. 2023. Gpt-4: Generative pre-trained transformer 4. https://openai.com.
  38. How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611.
  39. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
  40. Protection Regulation. 2016. Regulation (eu) 2016/679 of the european parliament and of the council. Regulation (eu), 679:2016.
  41. Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1):1–9.
  42. Stefan Schweter and Alan Akbik. 2020. Flert: Document-level features for named entity recognition.
  43. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  44. Identifying and mitigating privacy risks stemming from language models: A survey. arXiv preprint arXiv:2310.01424.
  45. Stanford alpaca: An instruction-following llama model.
  46. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.
  47. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  48. Downstream task performance of bert models pre-trained using automatically de-identified clinical data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4245–4252.
  49. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698.
  50. Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138.
  51. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500.
  52. Large scale private learning via low-rank reparametrization. In International Conference on Machine Learning, pages 12208–12218. PMLR.
  53. Robust lottery tickets for pre-trained language models. arXiv preprint arXiv:2211.03013.
Citations (8)

Summary

  • The paper introduces a novel method that identifies specific privacy neurons in LLMs using binary mask learning and adversarial training.
  • The study finds that a small subset of neurons exhibits PII specificity, playing a critical role in memorizing sensitive personal data.
  • Experiments demonstrate that deactivating these neurons can significantly reduce PII leakage with minimal impact on overall model performance.

The paper "Learnable Privacy Neurons Localization in LLMs" explores the critical issue of privacy in LLMs, focusing specifically on the memorization and potential disclosure of Personally Identifiable Information (PII). The community has expressed significant concerns about these privacy risks, leading to numerous efforts aimed at mitigating them. Despite these efforts, the underlying mechanisms by which LLMs memorize PII remain insufficiently understood.

To address this gap, the authors introduce a novel method that aims to identify the neurons within LLMs responsible for memorizing PII, termed "privacy neurons." This method utilizes learnable binary weight masks coupled with adversarial training to pinpoint these specific neurons. By employing these techniques, the researchers aim to isolate and analyze the neural components that contribute to PII memorization.

The main findings of the paper highlight that PII is typically memorized by a small, distinct subset of neurons distributed throughout the layers of the model. These neurons exhibit a unique property termed "PII specificity," indicating their specialized role in handling PII-related data.

To further validate the significance of these privacy neurons, the authors propose a methodology for mitigating PII risks by deactivating the localized neurons identified. They conduct both quantitative and qualitative experiments to test the effectiveness of this approach. The results from these experiments demonstrate that deactivating the privacy neurons significantly reduces the risk of PII disclosure without severely impacting the model's overall performance.

In summary, this research provides valuable insights into the specific neural mechanisms of PII memorization within LLMs and offers a promising strategy for enhancing privacy protection by targeting and deactivating privacy-sensitive neurons.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets