GENIE: Generative Note Information Extraction model for structuring EHR data (2501.18435v1)

Published 30 Jan 2025 in cs.CL

Abstract: Electronic Health Records (EHRs) hold immense potential for advancing healthcare, offering rich, longitudinal data that combines structured information with valuable insights from unstructured clinical notes. However, the unstructured nature of clinical text poses significant challenges for secondary applications. Traditional methods for structuring EHR free-text data, such as rule-based systems and multi-stage pipelines, are often limited by their time-consuming configurations and inability to adapt across clinical notes from diverse healthcare settings. Few systems provide a comprehensive attribute extraction for terminologies. While giant LLMs like GPT-4 and LLaMA 405B excel at structuring tasks, they are slow, costly, and impractical for large-scale use. To overcome these limitations, we introduce GENIE, a Generative Note Information Extraction system that leverages LLMs to streamline the structuring of unstructured clinical text into usable data with standardized format. GENIE processes entire paragraphs in a single pass, extracting entities, assertion statuses, locations, modifiers, values, and purposes with high accuracy. Its unified, end-to-end approach simplifies workflows, reduces errors, and eliminates the need for extensive manual intervention. Using a robust data preparation pipeline and fine-tuned small scale LLMs, GENIE achieves competitive performance across multiple information extraction tasks, outperforming traditional tools like cTAKES and MetaMap and can handle extra attributes to be extracted. GENIE strongly enhances real-world applicability and scalability in healthcare systems. By open-sourcing the model and test data, we aim to encourage collaboration and drive further advancements in EHR structurization.

Summary

The paper introduces GENIE, an open-source, generative model fine-tuned from Llama-3.1-8B-Instruct, designed for end-to-end structuring of unstructured clinical text in EHRs into a standardized JSON format.
GENIE employs a single-pass, generative approach to extract comprehensive attributes like semantic type, assertion status, location, and values from paragraphs of clinical notes, replacing complex multi-stage NLP pipelines.
Experimental results show GENIE outperforms traditional methods like cTAKES and MetaMap in attribute extraction and is efficient enough to run on consumer hardware, offering a scalable solution for EHR data structuring.

Okay, here's a detailed summary of the paper you provided, "GENIE: Generative Note Information Extraction model for structuring EHR data," focusing on its key contributions, methodology, results, and limitations:

Core Problem and Motivation:

The paper addresses the challenge of transforming unstructured clinical text in Electronic Health Records (EHRs) into a structured format suitable for secondary uses like research, clinical decision support, and policy design. While EHRs contain rich information, the unstructured nature of clinical notes makes it difficult to efficiently extract and analyze this data. Traditional methods for structuring EHRs (rule-based systems and multi-stage NLP pipelines) are often time-consuming to configure, lack adaptability across different clinical settings, and are difficult to maintain because upgrading any single module could destabilize the entire system. LLMs are promising, but computationally expensive and raise privacy concerns when using proprietary models on sensitive EHR data.

Proposed Solution: GENIE

The authors introduce GENIE (Generative Note Information Extraction), a novel, open-source system for end-to-end EHR structurization. GENIE is a generative model, fine-tuned from Llama-3.1-8B-Instruct, that processes entire paragraphs of clinical text in a single pass, extracting key information and structuring it into a standardized JSON format.

Key Features and Advantages of GENIE:

End-to-End Approach: Replaces complex, multi-stage pipelines with a single model. This drastically simplifies system maintenance and upgrades.
Generative Model: Instead of classifying or labeling, GENIE generates the structured information, including standardized medical terms and associated attributes.
Comprehensive Attribute Extraction: Extracts not only entities but also a rich set of attributes:
- Entities (medical terms) with standardized names (handling abbreviations).
- Semantic type (e.g., Disease, Symptom, Procedure).
- Assertion status (present, absent, possible, etc.).
- Body location.
- Modifiers.
- Values and units.
- Purpose.
Efficiency and Scalability: Designed to run locally on consumer-level hardware (e.g., NVIDIA RTX 3090) and processes EHR notes in a single pass, making it faster and more cost-effective than using large, proprietary LLMs directly.
Open Source: Encourages collaboration and further development in EHR structurization.

Methodology: Training GENIE

The training pipeline for GENIE involves several key steps:

EHR Corpus: Uses the discharge reports from the MIMIC-III database.
Data Preprocessing and Term Recognition:
- Employs forward maximum matching (trie search) with the BIOS v3 ontology for initial term identification.
- Uses ChatGPT to perform line break restoration and expand abbreviations within the clinical notes to improve term recognition. The raw (unexpanded) EHR notes are used as the input to GENIE during training so that the model learns to perform word sense disambiguation (WSD) automatically.
- Filters identified terms based on semantic types to focus on essential medical concepts.
Assertion Status Annotation:
- Trains a separate assertion status annotation model based on Llama-2-7b, using the 2010 i2b2/VA NLP challenge data.
- Applies this model to the identified terms in the MIMIC data.
- Implements rule-based corrections to improve the accuracy of certain assertion status labels (e.g., "Conditional" for allergies).
Location, Modifier, Value/Unit, and Purpose Annotation:
- These attributes are annotated using GPT models (ranging from ChatGPT to GPT-4o) through iterative prompt engineering. The prompts are designed to minimize missed terms and mismatched results.
- A table specifies which attributes are applicable to each semantic type to ensure consistency and accuracy.
Data Integration and Training:
- Integrates the identified terms with their corresponding attributes into a JSON format. Matching is guided by term order and position within the note.
- Filters data to retain samples with fewer than 8,000 tokens to balance input length and output quality. Recommends sectioning notes to shorter than 800 tokens to ensure complete output.
- The assertion status model is trained on 1,200 samples, and the GENIE model is trained on 180,000 samples.

Experimental Results:

Test Dataset: A manually annotated test set of 24 paragraphs (448 phrases) from MIMIC-III, annotated by expert annotators.
Evaluation Metrics: F1-score for phrase extraction, and accuracy for attribute extraction (assertion status, location, modifier, value, unit). GPT-4o is used to assess equivalence for attributes.
Baseline Models: Compared GENIE to cTAKES and MetaMap, highlighting their limitations in comprehensive attribute extraction.
Results: GENIE outperforms the baseline models across all attribute extraction tasks, demonstrating competitive performance in phrase extraction and superior performance in extracting additional attributes that other software can't handle. Assertion Status accuracy was slightly lower than the single-task LLM model and the authors attribute this to a possible bias towards the 'present' label.

Case Study:

Manual inspection of test samples reveals that GENIE:

Robustly recognizes and restores abbreviated patterns in EHRs, including acronyms and irregular formats.
Accurately extracts locations, values, and assertion statuses, even when these attributes appear in distant sentences.
Demonstrates a good ability in inferring the purpose of entities.

Limitations and Future Work:

The authors acknowledge several limitations:

Inconsistent Formatting of Values and Units: Extracted values and units are not always unified into a standard format.
Potential Errors in GPT-Generated Attributes: The reliance on GPT for location and purpose annotation can lead to occasional errors.
Ambiguity in Acronyms: Can sometimes lead to phrase extraction errors.

Future work will focus on:

Generating passage-specific acronym tables to improve term extraction.
Using cross-validation with other LLMs to verify term attributes and standardize outputs.
Developing models for other languages, such as Chinese.

Conclusion:

GENIE represents a significant advancement in EHR structurization by providing an efficient, accurate, and open-source solution for extracting a wide range of information from clinical text. The end-to-end approach simplifies the structuring process, making it more accessible and scalable for healthcare applications. The authors aim to address the identified limitations and further improve the model's robustness and applicability in future work.

PDF Markdown

GENIE: Generative Note Information Extraction model for structuring EHR data (2501.18435v1)

Summary

Related Papers