Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

De-identification is not always enough (2402.00179v1)

Published 31 Jan 2024 in cs.CL

Abstract: For sharing privacy-sensitive data, de-identification is commonly regarded as adequate for safeguarding privacy. Synthetic data is also being considered as a privacy-preserving alternative. Recent successes with numerical and tabular data generative models and the breakthroughs in large generative LLMs raise the question of whether synthetically generated clinical notes could be a viable alternative to real notes for research purposes. In this work, we demonstrated that (i) de-identification of real clinical notes does not protect records against a membership inference attack, (ii) proposed a novel approach to generate synthetic clinical notes using the current state-of-the-art LLMs, (iii) evaluated the performance of the synthetically generated notes in a clinical domain task, and (iv) proposed a way to mount a membership inference attack where the target model is trained with synthetic data. We observed that when synthetically generated notes closely match the performance of real data, they also exhibit similar privacy concerns to the real data. Whether other approaches to synthetically generated clinical notes could offer better trade-offs and become a better alternative to sensitive real notes warrants further investigation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. What’s in a note? unpacking predictive value in clinical note representations. \JournalTitleAMIA Summits on Translational Science Proceedings 2018, 26 (2018).
  2. Van Aken, B. et al. Clinical outcome prediction from admission notes using self-supervised knowledge integration. \JournalTitlearXiv preprint arXiv:2102.04110 (2021).
  3. Predicting mortality in critically ill patients with diabetes using machine learning and clinical notes. \JournalTitleBMC medical informatics and decision making 20, 1–7 (2020).
  4. Enhancing prediction models for one-year mortality in patients with acute myocardial infarction and post myocardial infarction syndrome. \JournalTitleStudies in health technology and informatics 264, 273 (2019).
  5. Cai, X. et al. Real-time prediction of mortality, readmission, and length of stay using electronic health record data. \JournalTitleJournal of the American Medical Informatics Association 23, 553–561 (2016).
  6. Evaluating identity disclosure risk in fully synthetic health data: model development and validation. \JournalTitleJournal of medical Internet research 22, e23139 (2020).
  7. De-identification of electronic health record using neural network. \JournalTitleScientific reports 10, 18600 (2020).
  8. Urbain, J. et al. Natural language processing for enterprise-scale de-identification of protected health information in clinical notes. In AMIA Annual Symposium Proceedings, vol. 2022, 92 (American Medical Informatics Association, 2022).
  9. Yang, X. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. \JournalTitleBMC medical informatics and decision making 19, 1–9 (2019).
  10. Scaiano, M. et al. A unified framework for evaluating the risk of re-identification of text de-identification tools. \JournalTitleJournal of biomedical informatics 63, 174–183 (2016).
  11. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), 3–18 (IEEE, 2017).
  12. A taxonomy and terminology of adversarial machine learning. \JournalTitleNIST IR 2019, 1–29 (2019).
  13. Label-only membership inference attacks. In International conference on machine learning, 1964–1974 (PMLR, 2021).
  14. Salem, A. et al. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. \JournalTitlearXiv preprint arXiv:1806.01246 (2018).
  15. Carlini, N. et al. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), 1897–1914 (IEEE, 2022).
  16. TensorFlow.org. "Assess privacy risks with the TensorFlow Privacy Report." (2022). Available at: https://www.tensorflow.org/responsible_ai/privacy/tutorials/privacy_report Accessed: 7th January 2024.
  17. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning. \JournalTitlearXiv preprint arXiv:2007.09339 (2020).
  18. Hu, H. et al. Membership inference attacks on machine learning: A survey. \JournalTitleACM Computing Surveys (CSUR) 54, 1–37 (2022).
  19. The synthetic data paradigm for using and sharing data. \JournalTitleCutter Executive Update 19, 1–12 (2019).
  20. Using membership inference attacks to evaluate privacy-preserving language modeling fails for pseudonymizing data. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 318–323 (2023).
  21. Membership inference attack susceptibility of clinical language models. \JournalTitlearXiv preprint arXiv:2104.08305 (2021).
  22. Moramarco, F. et al. Human evaluation and correlation with automatic metrics in consultation note generation. \JournalTitlearXiv preprint arXiv:2204.00447 (2022).
  23. Faequa, T. Privacy-Preserving Generation of Textual Healthcare Data ([Master’s dissertation, The University of Regina], Canada, 2021).
  24. Al Aziz, M. M. et al. Differentially private medical texts generation using generative neural networks. \JournalTitleACM Transactions on Computing for Healthcare (HEALTH) 3, 1–27 (2021).
  25. Generating synthetic training data for supervised de-identification of electronic health records. \JournalTitleFuture Internet 13, 136 (2021).
  26. Textual data distributions: Kullback leibler textual distributions contrasts on gpt-2 generated texts, with supervised, unsupervised learning on vaccine & market topics & sentiment. \JournalTitlearXiv preprint arXiv:2107.02025 (2021).
  27. Bertscore: Evaluating text generation with bert. \JournalTitlearXiv preprint arXiv:1904.09675 (2019).
  28. Re-evaluating the role of bleu in machine translation research. In 11th conference of the european chapter of the association for computational linguistics, 249–256 (2006).
  29. Li, J. et al. Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition. \JournalTitleJournal of the American Medical Informatics Association 28, 2193–2201 (2021).
  30. OpenAI. Gpt-4. (2023). Available at: https://openai.com/research/gpt-4 Accessed: 7th January 2024.
  31. Meta AI Research. "Leaderboard: Medical Code Prediction on MIMIC-III." (2022). Available at: https://paperswithcode.com/sota/medical-code-prediction-on-mimic-iii Accessed: 7th January 2024.
  32. Zeng, M. et al. Automatic icd-9 coding via deep transfer learning. \JournalTitleNeurocomputing 324, 43–50 (2019).
  33. A label attention model for icd coding from clinical text. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), 3335–3341 (2020).
  34. Google LLC. "TensorFlow Privacy." (2023). Library for training machine learning models with privacy for training data. Version 0.8.8. Available at: https://github.com/tensorflow/privacy Accessed: 7th January 2024.
  35. Yue, X. et al. Synthetic text generation with differential privacy: A simple and practical recipe. \JournalTitlearXiv preprint arXiv:2210.14348 (2022).
  36. Information and Privacy Commissioner of Ontario. "De-identification." (2016). Available at: https://www.ipc.on.ca/privacy-organizations/de-identification/ Accessed: 7th January 2024.
  37. European Medicines Agency, GT. External guidance on the implementation of the european medicines agency policy on the publication of clinical data for medicinal products for human use (2018).
  38. Function calling and other api updates. (2023). Available at: https://openai.com/blog/function-calling-and-other-api-updates Accessed: 7th January 2024.
  39. Exploring linguistically-lightweight keyword extraction techniques for indexing news articles in a multilingual set-up. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, 35–44 (2021).
  40. A review of keyphrase extraction. \JournalTitleWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10, e1339 (2020).
  41. Boudin, F. Pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: system demonstrations, 69–73 (2016).
  42. Campos, R. et al. Yake! keyword extraction from single documents using multiple local features. \JournalTitleInformation Sciences 509, 257–289 (2020).
  43. Cohen, A. Fuzzywuzzy: Fuzzy string matching in python. (2020). Available at: https://pypi.org/project/fuzzywuzzy/ Accessed: 7th January 2024.
  44. TensorFlow.org. Membership inference attack. (2020). Available at: https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/privacy_tests/membership_inference_attack/membership_inference_attack.py Accessed: 7th January 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Atiquer Rahman Sarkar (3 papers)
  2. Yao-Shun Chuang (5 papers)
  3. Noman Mohammed (11 papers)
  4. Xiaoqian Jiang (59 papers)
Citations (5)