De-identification of Patient Notes with Recurrent Neural Networks (1606.03475v1)

Published 10 Jun 2016 in cs.CL, cs.AI, cs.NE, and stat.ML

Abstract: Objective: Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information (PHI) that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of EHR databases, the limited number of researchers with access to the non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. Materials and Methods: We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. Results: Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall 97.38 and a precision of 97.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall 99.25 and a precision of 99.06. Conclusion: Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no feature engineering.

PDF Abstract

De-identification of Patient Notes with Recurrent Neural Networks

The paper "De-identification of Patient Notes with Recurrent Neural Networks" introduces a novel system utilizing artificial neural networks (ANNs) for de-identifying electronic health records (EHRs). This automated de-identification is crucial in ensuring patient confidentiality, particularly under regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States. The paper presents a method that neither depends on handcrafted features nor rules, often mandatory in previous systems, thereby offering a streamlined and efficient approach.

Methodological Innovations

The core innovation of this research lies in the substitution of traditional feature engineering with the ANN model. This reliance on ANNs allows for dynamic feature learning, contrasting with static rule-based systems or those dependent on supervised machine learning requiring extensive labeled datasets and manual feature crafting. The proposed system leverages recurrent neural networks (RNNs) and, more specifically, Long Short Term Memory (LSTM) networks to manage sequential data effectively.

The architecture comprises three layers:

Character-enhanced Token Embedding Layer: Utilizes both pre-trained token embeddings and character-level embeddings, allowing nuanced handling of token forms and out-of-vocabulary words.
Label Prediction Layer: Implements a bidirectional LSTM to predict token labels, combining forward and backward contextual information crucial for identifying protected health information (PHI).
Label Sequence Optimization Layer: Employs transition probability matrices to account for dependencies between sequential labels, achieving more coherent labeling sequences.

Empirical Evaluation and Results

The system's efficacy was benchmarked against state-of-the-art systems utilizing two principal datasets: the i2b2 2014 de-identification challenge dataset and the MIMIC de-identification dataset. The latter was uniquely compiled by the authors and is twice the size of i2b2, enhancing the robustness of the validation. On both datasets, the ANN model demonstrated superior performance, as evidenced by a high F1-score—97.85 for i2b2 and 99.23 for MIMIC. These metrics underscore a significant improvement in precision and recall over existing methods, marking a notable advance in automated EHR de-identification.

Notably, the ANN model's strengths were reflected in its ability to handle varying linguistic contexts and typos without pre-specified rules or gazetteers, making it inherently adaptable and reducing manual effort. Despite challenges in perfect precision, especially for names (a sensitive PHI category), the recall surpassed 99% for MIMIC—a promising indicator of its effectiveness for practical applications in medical settings.

Theoretical and Practical Implications

The implications of this work are twofold. Theoretically, it exemplifies the potential of neural networks to outperform traditional methods in natural language processing tasks by learning context from data, without manual intervention at the feature level. Practically, it provides a scalable, high-performance de-identification model pivotal for data-sharing initiatives in medical research, aligned with stringent privacy norms.

Moving forward, potential developments could entail integrating patient-specific gazetteers to boost precision in PHI detection, particularly for categories sensitive to false negatives like names. This integration would cater to local institutional needs, tailoring the model to different datasets' nuances.

Overall, this research contributes significantly to the computational methods employed in safeguarding patient privacy within the burgeoning domain of electronic health record management, setting a precedent for future advancements in automated text de-identification.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Franck Dernoncourt (161 papers)
Ji Young Lee (11 papers)
Peter Szolovits (44 papers)
Ozlem Uzuner (26 papers)

Citations (361)

View on Semantic Scholar

De-identification of Patient Notes with Recurrent Neural Networks (1606.03475v1)