De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks (2209.09631v2)
Abstract: Unstructured textual data are at the heart of health systems: liaison letters between doctors, operating reports, coding of procedures according to the ICD-10 standard, etc. The details included in these documents make it possible to get to know the patient better, to better manage him or her, to better study the pathologies, to accurately remunerate the associated medical acts\ldots All this seems to be (at least partially) within reach of today by artificial intelligence techniques. However, for obvious reasons of privacy protection, the designers of these AIs do not have the legal right to access these documents as long as they contain identifying data. De-identifying these documents, i.e. detecting and deleting all identifying information present in them, is a legally necessary step for sharing this data between two complementary worlds. Over the last decade, several proposals have been made to de-identify documents, mainly in English. While the detection scores are often high, the substitution methods are often not very robust to attack. In French, very few methods are based on arbitrary detection and/or substitution rules. In this paper, we propose a new comprehensive de-identification method dedicated to French-language medical documents. Both the approach for the detection of identifying elements (based on deep learning) and their substitution (based on differential privacy) are based on the most proven existing approaches. The result is an approach that effectively protects the privacy of the patients at the heart of these medical documents. The whole approach has been evaluated on a French language medical dataset of a French public hospital and the results are very encouraging.
- Flair: An easy-to-use framework for state-of-the-art nlp. In NAACL, 2019.
- Mlt-dfki at clef ehealth 2019: Multi-label classification of icd-10 codes with bert. In CLEF (Working Notes), pages 1–15, 2019.
- Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 901–914, 2013.
- Evaluating re-identification risks with respect to the hipaa privacy rule. Journal of the American Medical Informatics Association, 17(2):169–177, 2010.
- Enriching word vectors with subword information. CoRR, abs/1607.04606, 2016.
- De-identification of Emergency Medical Records in French: Survey and Comparison of State-of-the-Art Automated Systems, volume 34. LibraryPress@UF, May 2021.
- Broadening the scope of differential privacy using metrics. In International Symposium on Privacy Enhancing Technologies Symposium, pages 82–102. Springer, 2013.
- A privacy-preserving and standard-based architecture for secondary use of clinical data. Information, 13(2):87, 2022.
- Hipaa and protecting health information in the 21st century. Jama, 320(3):231–232, 2018.
- Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116, 2019.
- Supervised learning for the icd-10 coding of french clinical narratives. In MIE 2020-Medical Informatics Europe conference-Digital Personalized Health and Medicine, pages 1–5, 2020.
- Data and computing resources from the Department of Biomedical Informatics (DBMI) in the Blavatnik Institute at Harvard Medical School. Unstructured notes from the research patient data registry at partners healthcare (originally developed during the i2b2 project).
- Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. Journal of biomedical informatics, 50:173–183, 2014.
- De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association : JAMIA, 24, 06 2016.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
- Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 202–210, 2003.
- Computer-assisted de-identification of free text in the mimic ii database. In Computers in Cardiology, 2004, pages 341–344. IEEE, 2004.
- Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438. IEEE, 2013.
- Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association, 15(5):601–610, 2008.
- Is it possible to recover personal health information from an automatically de-identified corpus of french ehrs? In Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis, pages 31–39, 2015.
- De-identification of clinical notes in french: towards a protocol for reference corpus development. Journal of Biomedical Informatics, 50:151–161, 2014. Special Issue on Informatics Methods in Medical Privacy.
- Automatic de-identification of french clinical records: comparison of rule-based and machine-learning approaches. In MEDINFO 2013, pages 476–480. IOS Press, 2013.
- Bert syntactic transfer: A computational experiment on italian, french and english languages. Computer Speech & Language, 71:101261, 2022.
- Identifying personal health information using support vector machines. In i2b2 workshop on challenges in natural language processing for clinical data, pages 10–11. Citeseer, 2006.
- Evaluation of a deidentification (de-id) software engine to share pathology reports and clinical documents for research. American journal of clinical pathology, 121:176–86, 03 2004.
- Ridewaan Hanslo. Deep learning transformer architecture for named entity recognition on low resourced languages: State of the art results. CoRR, abs/2111.00830, 2021.
- Customization scenarios for de-identification of clinical notes. BMC Medical Informatics and Decision Making, 20, 01 2020.
- Hospital readmission in general medicine patients: a prediction model. Journal of general internal medicine, 25(3):211–219, 2010.
- The bounded laplace mechanism in differential privacy. arXiv preprint arXiv:1808.10410, 2018.
- spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
- Clinicalbert: Modeling clinical notes and predicting hospital readmission, 2020.
- Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. Journal of biomedical informatics, 99:103291, 2019.
- Mimic-iii clinical database (version 1.4). physionet, 2016. Available from: https://doi.org/10.13026/C2XW26.
- Nerda, 2021.
- Creation of a new longitudinal corpus of clinical narratives. Journal of biomedical informatics, 58:S6–S10, 2015.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
- D Lafky. The safe harbor method of de-identification: an empirical test. fourth national hipaa summit west; 2010.
- Neural architectures for named entity recognition, 2016.
- Practical very large scale crfs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504–513, 2010.
- Flaubert: Unsupervised language model pre-training for french, 2020.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, Sep 2019.
- Jason Michael Levine. De-identification of ICU patient records. PhD thesis, Massachusetts Institute of Technology, 2003.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. Journal of Biomedical Informatics, 58:S47–S52, 2015. Supplement: Proceedings of the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data.
- De-identification of clinical notes via recurrent neural network and conditional random field. Journal of Biomedical Informatics, 75, 06 2017.
- l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3–es, 2007.
- Camembert: a tasty french language model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
- doccano: Text annotation tool for human, 2018. Software available from https://github.com/doccano/doccano.
- Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death certificates in french, hungarian and italian. In CLEF (Working Notes), pages 1–18, 2018.
- Learning multilingual named entity recognition from wikipedia. Artificial Intelligence, 194:151–175, 2013. Artificial Intelligence, Wikipedia and Semi-Structured Resources.
- The QUAERO French medical corpus: A ressource for medical entity recognition and normalization. In Proc of BioTextMining Work, pages 24–30, 2014.
- Comparing transformer-based NER approaches for analysing textual medical diagnoses. In Guglielmo Faggioli, Nicola Ferro, Alexis Joly, Maria Maistro, and Florina Piroi, editors, Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, pages 818–833. CEUR-WS.org, 2021.
- A scalable and pragmatic method for the safe sharing of high-quality health data. IEEE journal of biomedical and health informatics, 22(2):611–622, 2017.
- Iterative annotation of biomedical ner corpora with deep neural networks and knowledge bases. Applied Sciences, 12(12):5775, 2022.
- Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1. Journal of biomedical informatics, 58:S11–S19, 2015.
- Establishing a new state-of-the-art for french named entity recognition. CoRR, abs/2005.13236, 2020.
- Latanya Sweeney. Replacing personally-identifying information in medical records, the scrub system. In Proceedings of the AMIA annual fall symposium, page 333. American Medical Informatics Association, 1996.
- Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550–563, 2007.
- A de-identifier for medical discharge summaries. Artificial intelligence in medicine, 42 1:13–35, 2008.
- Attention is all you need, 2017.
- Wikipedia contributors. Icd-10 — Wikipedia, the free encyclopedia, 2021. [Online; accessed 2-January-2022].
- Yakini Tchouka (3 papers)
- Jean-François Couchot (35 papers)
- Maxime Coulmeau (1 paper)
- David Laiymani (3 papers)
- Philippe Selles (2 papers)
- Azzedine Rahmani (2 papers)