Recovering from Privacy-Preserving Masking with Large Language Models (2309.08628v3)
Abstract: Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream NLP models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage LLMs to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream LLMing tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.
- “An empirical study of transformer-based neural language model adaptation,” in Proc. ICASSP, 2020.
- “Private language model adaptation for speech recognition,” arXiv preprint arXiv:2110.10026, 2021.
- “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proc. ACM SIGSAC, 2015.
- “Auditing data provenance in text-generation models,” in Proc. ACM SIGKDD, 2019.
- “The secret sharer: Evaluating and testing unintended memorization in neural networks,” in 28th USENIX Security Symposium, 2019.
- “Extracting training data from large language models,” in 30th USENIX Security Symposium, 2021.
- “Detecting unintended memorization in language-model-fused ASR,” in Proc. Interspeech, 2022.
- “Quantifying memorization across neural language models,” arXiv preprint arXiv:2202.07646, 2022.
- “Privacy protection of textual attributes through a semantic-based masking method,” Information Fusion, vol. 13, no. 4, pp. 304–314, 2012.
- “How to keep text private? a systematic review of deep learning methods for privacy-preserving natural language processing,” Artificial Intelligence Review, vol. 56, no. 2, pp. 1427–1492, 2023.
- Judita Preiss, “Automatic named entity obfuscation in speech,” in Findings of ACL, 2023.
- “Attention is all you need,” Advances in NeurIPS, 2017.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- OpenAI, “ChatGPT: Optimizing language models for dialogue,” Feb 2022.
- “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- “Recurrent neural network based language model,” in Proc. Interspeech, 2010.
- “Improving the training and evaluation efficiency of recurrent neural network language models,” in Proc. ICASSP, 2015.
- “Efficient lattice rescoring using recurrent neural network language models,” in Proc. ICASSP, 2014.
- “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018.
- “Language modeling with deep transformers,” in Proc. Interspeech, 2019.
- “Anonymisation models for text data: State of the art, challenges and future directions,” in Proc. ACL, 2021.
- “Customization scenarios for de-identification of clinical notes,” BMC medical informatics and decision making, vol. 20, no. 1, pp. 1–9, 2020.
- “The state of profanity obfuscation in natural language processing,” arXiv preprint arXiv:2210.07595, 2022.
- “Obfuscation for privacy-preserving syntactic parsing,” 2020.
- “Using text injection to improve recognition of personal identifiers in speech,” arXiv preprint arXiv:2308.07393, 2023.
- “The fisher corpus: a resource for the next generations of speech-to-text,” in International Conference on Language Resources and Evaluation, 2004.
- “The Pushshift reddit dataset,” in International Conference on Web and Social Media, 2020.
- “SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition,” arXiv preprint arXiv:1910.13934, 2019.
- “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016.
- “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in Proc. ICASSP, 2021.
- “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
- Erik F. Tjong Kim Sang and Fien De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proc. of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003.
- “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
- Arpita Vats (12 papers)
- Zhe Liu (234 papers)
- Peng Su (17 papers)
- Debjyoti Paul (9 papers)
- Yingyi Ma (9 papers)
- Yutong Pang (7 papers)
- Zeeshan Ahmed (95 papers)
- Ozlem Kalinli (49 papers)