Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research (2001.09765v1)

Published 13 Jan 2020 in cs.CY and cs.LG

Abstract: Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared the cohort generated by MACS to the cohort that would have been generated without MACS as reference standard, first by comparing distributions of an extensive set of clinical and demographic variables and then by comparing the results of two analyses addressing existing example research questions. Results Our algorithm had an area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction efficiency gain of 77.9%. During Bias Analysis, we found no large differences in baseline characteristics and no differences in the example analyses. Conclusion MACS with bias analysis can significantly improve the efficiency of cohort selection on EHR data while instilling confidence that outcomes research performed on the resulting cohort will not be biased.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Benjamin Birnbaum (5 papers)
  2. Nathan Nussbaum (2 papers)
  3. Katharina Seidl-Rathkopf (1 paper)
  4. Monica Agrawal (24 papers)
  5. Melissa Estevez (2 papers)
  6. Evan Estola (1 paper)
  7. Joshua Haimson (1 paper)
  8. Lucy He (1 paper)
  9. Peter Larson (1 paper)
  10. Paul Richardson (1 paper)
Citations (172)

Summary

Overview of Model-Assisted Cohort Selection with Bias Analysis for EHR Data Usage in Oncology Research

This paper presents a comprehensive examination of the Model-Assisted Cohort Selection (MACS) with Bias Analysis framework, aiming to enhance the efficiency and reliability of cohort selection using Electronic Health Records (EHRs) for oncology research. The paper specifically applies this approach to metastatic breast cancer (mBC) patients, highlighting the challenges posed by the reliance on unstructured data within EHR systems and potential biases introduced by ML models.

Key Contributions and Methodology

The MACS framework is designed to address the manual and resource-intensive nature of generating research cohorts from EHR data. It introduces a dual-step approach: firstly, an ML model predicts potential cohort-eligible patients, and secondly, human abstractors review the selected cases for precise inclusion. Bias Analysis is employed to compare the MACS-chosen cohort with a reference standard set, originally curated via manual review, to ensure no systematic bias is introduced.

The authors implemented this methodology through:

  • Data Sources and Processing: Utilization of a comprehensive dataset from Flatiron Health, containing over two million active cancer patients. The paper focused on structured and unstructured data elements to facilitate its algorithm.
  • Model Training: A logistic regression model with a term-frequency inverse-document-frequency (TF-IDF) feature extraction technique was trained on a large subset of identified breast cancer cases. The model demonstrated remarkable performance, achieving an AUC of 0.976.
  • Bias Analysis: The paper evaluated potential biases by comparing demographic and clinical characteristics between the MACS-derived cohort and the reference standard, alongside outcome analyses such as overall survival (OS) based on hormone receptor and HER2 status.

Results and Implications

The application of MACS yielded an abstraction efficiency gain of 77.9%, with a sensitivity rate of 96.0%, indicating significant reductions in manual chart review workload while retaining a high degree of cohort accuracy. Importantly, the Bias Analysis confirmed no clinically meaningful biases in the demographic or clinical characteristics between the MACS cohort and the reference standard.

The methodological implications of this work are pronounced, particularly in improving the precision and scalability of cohort selection processes in oncology research. MACS provides a robust framework capable of responding to the intricacies of EHR data, facilitating large-scale research endeavors by alleviating human resource constraints.

Future Directions

This research paves the way for further development and application of MACS in other domains beyond oncology and can evolve to accommodate longitudinal cohort selection through real-time EHR data updates. The scalability of MACS presents opportunities across various disease states where unstructured data significantly contributes to patient characterization.

Conclusion

The paper presents a structured methodology to tackle one of the prevailing challenges in utilizing EHR data for research purposes: the efficient and accurate selection of patient cohorts. Through the integration of ML approaches with detailed bias analysis, MACS advances the field by mitigating the introduction of bias, thus enabling confident and methodologically sound utilization of unstructured EHR data in oncology outcomes research.