Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards the Anonymization of the Language Modeling (2501.02407v2)

Published 5 Jan 2025 in cs.CL, cs.CR, and cs.LG

Abstract: Rapid advances in NLP have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models fine-tuned and specialized on sensitive data can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving LLMing approach to address the problem of LLMs anonymization, and thus promote their sharing. Specifically, we propose both a Masking LLMing (MLM) methodology to specialize a BERT-like LLM, and a Causal LLMing (CLM) methodology to specialize a GPT-like model that avoids the model from memorizing direct and indirect identifying information present in the training data. We have comprehensively evaluated our approaches using a medical dataset and compared them against different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masking and causal LLMing schemes offer a good tradeoff for maintaining high privacy while retaining high utility.

Summary

We haven't generated a summary for this paper yet.