Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques (2505.09794v1)

Published 14 May 2025 in cs.CL and cs.AI

Abstract: Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, NLP offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.

Authors (5)

J. Moreno-Casanova (1 paper)
J. M. Auñón (2 papers)
A. Mártinez-Pérez (1 paper)
M. E. Pérez-Martínez (1 paper)
M. E. Gas-López (1 paper)

Summary

Overview of Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

The paper "Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques" presents a robust paper aimed at enhancing the extraction of clinical information from electronic health records (EHRs) using advanced NLP methods. Specifically, the authors focus on leveraging Named Entity Recognition (NER) to automatically identify and categorize clinical entities pertinent to lung and breast cancer reports. The importance of this work lies in addressing the inefficiencies of manual extraction methods, which are labor-intensive and error-prone, ultimately hindering data-driven healthcare advancements.

This paper employed GMV's NLP tool, uQuery, renowned for its ability to detect and structure clinical entities into standardized formats like SNOMED and OMOP. The authors utilized a dataset from Health Research Institute Hospital La Fe, which comprised 600 annotated reports (200 breast cancer, 400 lung cancer) with entities labeled using the Doccano platform. The model fine-tuning was conducted on the RoBERTa-based bsc-bio-ehr-en3 model, pre-trained in Spanish, using the Transformers architecture. This approach facilitated accurate entity recognition across various clinical aspects, enhancing the processing of EHRs.

Strong Numerical Results and Claims

The paper boasts impressive numerical results. The model consistently achieved high precision, recall, and F1 scores, underscoring its proficiency in detecting entities like MET and PAT. For the combined dataset validation, the model showed an accuracy of approximately 95.6% and an F1 score of around 0.74, highlighting its effectiveness. Moreover, through manual validation, the paper reported a notable 98.5% accuracy in entity detection, emphasizing the model’s reliability in identifying clinical entities amidst a broad spectrum of report types.

Discussion of Implications

The successful application of NLP techniques in this context has several significant implications. Practically, the enhanced automation of clinical data extraction can streamline clinical workflow, reducing the administrative burden on healthcare professionals and potentially accelerating research and patient care processes. It can also support large-scale analysis of real-world evidence (RWE), facilitating clinical decision support, personalized medicine development, and biomarker identification.

Theoretically, this work contributes to the growing body of research on medical NLP applications, demonstrating the potential of machine learning models in processing highly specialized, unstructured text data from clinical reports. Moreover, the paper illustrates the importance of integrating pre-processing layers to optimize model performance.

Speculation on Future Developments

Looking forward, continuous refinements in model architectures, improved training techniques, and expanded datasets could further enhance entity detection capabilities. Future work could focus on optimizing detection for less frequent entities or adapting the model for multilingual or multi-center datasets to increase its applicability across diverse healthcare settings. Additionally, incorporating real-time processing could support dynamic EHR analysis, propelling advancements in predictive analytics and proactive healthcare interventions.

In summary, this research exemplifies the promising intersection of AI and healthcare, paving the way for more efficient and accurate clinical data analysis, with the potential to transform healthcare delivery and outcomes.

Related Papers

Find Related Papers

YouTube

Show All Videos