Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recovering from Privacy-Preserving Masking with Large Language Models (2309.08628v3)

Published 12 Sep 2023 in cs.CL, cs.CR, and cs.LG

Abstract: Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream NLP models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage LLMs to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream LLMing tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. “An empirical study of transformer-based neural language model adaptation,” in Proc. ICASSP, 2020.
  2. “Private language model adaptation for speech recognition,” arXiv preprint arXiv:2110.10026, 2021.
  3. “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proc. ACM SIGSAC, 2015.
  4. “Auditing data provenance in text-generation models,” in Proc. ACM SIGKDD, 2019.
  5. “The secret sharer: Evaluating and testing unintended memorization in neural networks,” in 28th USENIX Security Symposium, 2019.
  6. “Extracting training data from large language models,” in 30th USENIX Security Symposium, 2021.
  7. “Detecting unintended memorization in language-model-fused ASR,” in Proc. Interspeech, 2022.
  8. “Quantifying memorization across neural language models,” arXiv preprint arXiv:2202.07646, 2022.
  9. “Privacy protection of textual attributes through a semantic-based masking method,” Information Fusion, vol. 13, no. 4, pp. 304–314, 2012.
  10. “How to keep text private? a systematic review of deep learning methods for privacy-preserving natural language processing,” Artificial Intelligence Review, vol. 56, no. 2, pp. 1427–1492, 2023.
  11. Judita Preiss, “Automatic named entity obfuscation in speech,” in Findings of ACL, 2023.
  12. “Attention is all you need,” Advances in NeurIPS, 2017.
  13. “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  14. “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  15. OpenAI, “ChatGPT: Optimizing language models for dialogue,” Feb 2022.
  16. “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  17. “Recurrent neural network based language model,” in Proc. Interspeech, 2010.
  18. “Improving the training and evaluation efficiency of recurrent neural network language models,” in Proc. ICASSP, 2015.
  19. “Efficient lattice rescoring using recurrent neural network language models,” in Proc. ICASSP, 2014.
  20. “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018.
  21. “Language modeling with deep transformers,” in Proc. Interspeech, 2019.
  22. “Anonymisation models for text data: State of the art, challenges and future directions,” in Proc. ACL, 2021.
  23. “Customization scenarios for de-identification of clinical notes,” BMC medical informatics and decision making, vol. 20, no. 1, pp. 1–9, 2020.
  24. “The state of profanity obfuscation in natural language processing,” arXiv preprint arXiv:2210.07595, 2022.
  25. “Obfuscation for privacy-preserving syntactic parsing,” 2020.
  26. “Using text injection to improve recognition of personal identifiers in speech,” arXiv preprint arXiv:2308.07393, 2023.
  27. “The fisher corpus: a resource for the next generations of speech-to-text,” in International Conference on Language Resources and Evaluation, 2004.
  28. “The Pushshift reddit dataset,” in International Conference on Web and Social Media, 2020.
  29. “SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition,” arXiv preprint arXiv:1910.13934, 2019.
  30. “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016.
  31. “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in Proc. ICASSP, 2021.
  32. “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
  33. Erik F. Tjong Kim Sang and Fien De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proc. of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003.
  34. “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Arpita Vats (12 papers)
  2. Zhe Liu (234 papers)
  3. Peng Su (17 papers)
  4. Debjyoti Paul (9 papers)
  5. Yingyi Ma (9 papers)
  6. Yutong Pang (7 papers)
  7. Zeeshan Ahmed (95 papers)
  8. Ozlem Kalinli (49 papers)
Citations (9)

Summary

An Examination of Privacy-Preserving Data Masking Recovery Using LLMs

The paper "Recovering from Privacy-Preserving Masking with LLMs" addresses the critical challenge of balancing privacy preservation in user data with the efficacy of machine learning models. As users' data privacy becomes paramount, approaches that replace sensitive information in text with generic markers or masks have emerged. This paper's central contribution is leveraging LLMs to replace masked tokens in a manner that preserves model performance in downstream language tasks.

Problem Context and Methodology

Deploying machine learning models often faces the challenge of discrepancies between training data and end-user data. This is particularly prominent in the domain of NLP, where models require adaptation to domain-specific textual data that may contain sensitive information. Traditional methods of adapting to user data risk inadvertently revealing private user details.

To mitigate these risks, the paper explores various privacy-preserving masking techniques that effectively obfuscate sensitive data by replacing certain tokens with a generic marker such as “[MASK]”. Three distinctive strategies for automatic token masking are presented:

  1. Allow List: Only tokens present in a predefined list of common, nonsensitive words are retained unmasked.
  2. Vocabulary Threshold: Common words above a certain frequency in a broad dataset are retained, masking rarer terms presumed to be more sensitive.
  3. Entity Tagger: Utilizes Named Entity Recognition (NER) models to identify and subsequently mask named entities like names and locations.

Once masked, the critical research question addressed is selecting suitable substitutes for these masked tokens using LLMs to maintain semantic integrity and model performance.

The methodologies proposed to recover masked data leverage several LLM-based techniques:

  • Top-K Selection: Instead of always selecting the single best prediction from an LLM, the model samples from the top-K candidates, introducing variability that can improve model robustness by simulating potential variations of the original data.
  • Fine-Tuning: Pre-trained models like BERT, RoBERTa, and LLaMA2 are further fine-tuned on domain-specific data or synthetic data derived from original masking techniques, improving contextual prediction accuracy.

Experimental Evaluations

Empirical evaluations were conducted across datasets such as Fisher, Reddit, and WSJ, with both LLMs and Automatic Speech Recognition (ASR) systems serving as downstream tasks. Key findings are outlined as follows:

  • Performance: Methods based on RoBERTa consistently showed superior token recovery performance across datasets. Notably, perplexity scores on LLMing tasks demonstrated that models trained on data with recovered tokens approached those trained on unmasked data, particularly with vocabThres and entityTagger strategies.
  • Fine-Tuning Impact: Fine-tuning substantially enhanced token recovery, particularly when domain data was scarce. The distillation of both BERT and RoBERTa models revealed an improvement in prediction fidelity on token substitution.
  • ASR Implications: By integrating recovered LLMs with ASR through shallow fusion, substantial improvements in WER were achieved, underlining the practical implications of the token recovery methodology.

Implications and Future Directions

This work paves the way for advancing privacy-preserving model training, particularly focusing on NLP and ASR systems' adaptability to domain-specific contexts. The ability to balance privacy and performance opens new avenues for machine learning applications where sensitive data handling is essential. Future research directions could explore more nuanced token recovery techniques, perhaps incorporating class-specific markers or developing objective functions more directly tied to downstream tasks' performance.

Additionally, ongoing advancements in LLM architectures may further reduce the performance gap between models trained on obfuscation corpora and those using original data, enhancing the viability of privacy-sensitive machine learning solutions in real-world applications.

Youtube Logo Streamline Icon: https://streamlinehq.com