Medical Spoken Named Entity Recognition: Overview and Analysis
The paper "Medical Spoken Named Entity Recognition" introduces VietMed-NER, a pioneer dataset focusing on Named Entity Recognition (NER) within the medical domain of spoken language. This dataset addresses the complexities and challenges associated with NER tasks specifically tailored for medical conversations in Vietnamese, a largely underrepresented language in spoken NER research. The dataset is presented as the largest of its kind, featuring 18 distinct entity types pertinent to the medical domain, showcasing a substantial contribution to the field.
Dataset and Methodology
The VietMed-NER dataset is constructed using real-world audio from the VietMed ASR dataset. It introduces 18 medically relevant entity types, integrated into 9,000 annotated sentences. The distribution into training, development, and test sets, adhering to respective durations, is aimed at leveraging the capabilities of large pre-trained models. The annotation process employs a novel methodology titled "Recursive Greedy Mapping." This approach is devised to enhance annotation efficiency and ensure consistency, counteracting challenges of data quality, such as missing entity tags and segmentation errors traditionally seen in other datasets.
Experimental Setup and Models
The research employs a two-stage pipeline approach for spoken NER, encompassing first the transcription of audio using ASR models followed by NER. For Automatic Speech Recognition (ASR), the paper utilizes models pre-trained on extensive Vietnamese data, specifically examining w2v2-Viet and XLSR-53-Viet models. The Word Error Rates (WERs) observed were 29.0% and 28.8%, respectively.
For the NER task, a comparative analysis of multiple state-of-the-art monolingual and multilingual pre-trained models is conducted. This includes models like PhoBERT, ViDeBERTa, XLM-R, and others varying significantly in the volume of pre-training data. XLM-R models, trained on 2.5TB of multilingual data, consistently outperform their monolingual counterparts, demonstrating the advantage of extensive pre-training and multilingual capability in NER tasks.
Results and Performance
The NER model analysis on reference text indicated that PhoBERT_base-v2 outperformed smaller monolingual counterparts, likely benefiting from increased training data. However, XLM-R large, a multilingual model with extensive training data, yielded superior results across both reference text and ASR outputs, with an F1 score of 74.0% on reference text, emphasizing the efficacy of larger multilingual models for spoken NER tasks.
Implications and Future Directions
The introduction of the VietMed-NER dataset promises advancements in various medical language processing applications, such as correcting errors in medical ASR outputs or enhancing privacy in speech data mining. Future research directions could include exploring more robust models tailored for low-resource languages like Vietnamese and further refining the annotation methodology for consistency across multilingual datasets.
The findings suggest that multilingual models with sufficient pre-training can offer substantial improvements in NER tasks, potentially guiding future research in AI towards focusing on data diversity and the transferability of model capabilities across languages. Additionally, exploring new annotation methodologies like Recursive Greedy Mapping could pave the way for innovation in dataset creation and reliability, particularly in resource-constrained settings.