Towards the Anonymization of the Language Modeling (2501.02407v2)
Abstract: Rapid advances in NLP have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models fine-tuned and specialized on sensitive data can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving LLMing approach to address the problem of LLMs anonymization, and thus promote their sharing. Specifically, we propose both a Masking LLMing (MLM) methodology to specialize a BERT-like LLM, and a Causal LLMing (CLM) methodology to specialize a GPT-like model that avoids the model from memorizing direct and indirect identifying information present in the training data. We have comprehensively evaluated our approaches using a medical dataset and compared them against different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masking and causal LLMing schemes offer a good tradeoff for maintaining high privacy while retaining high utility.