- The paper demonstrates a novel method using over 5 million ORCID profiles to generate large-scale labeled datasets for evaluating author name disambiguation tools like Author-ity2009.
- Evaluation shows tools like Author-ity2009 achieve high precision (around 0.99), outperforming simple heuristics, and perform well on challenging names across various ethnicities.
- ORCID linkage provides a scalable, diverse data source for robust evaluation of disambiguation tools, highlighting their strengths and weaknesses for different name types.
The paper "ORCID-linked labeled data for evaluating author name disambiguation at scale" by Jinseok Kim and Jason Owen-Smith addresses the challenge of evaluating author name disambiguation methods in large-scale bibliographic datasets, with a specific focus on leveraging ORCID as an authoritative source for creating labeled data. ORCID, an open researcher and contributor ID system, provides researcher profiles containing authorship information that can be linked to bibliographic records to generate large-scale labeled datasets. This study utilizes over 5 million ORCID profiles to evaluate the disambiguation capabilities of the Author-ity2009, a tool employed for disambiguating author names in the extensive MEDLINE database.
Methodology
- Data Linking and Collection:
- The paper develops datasets by linking MEDLINE name instances with ORCID and NIH-funded researcher profiles. It also employs self-citation information to generate labeled datasets.
- Three labeled datasets are constructed:
- AUT-ORC: Linking Author-ity2009 with ORCID profiles, resulting in 3 million labeled instances.
- AUT-NIH: Linking Author-ity2009 with NIH PI data, offering 313K labeled instances.
- AUT-SCT: Using self-citation data to generate more than 6 million instance pairs.
- Evaluation Criteria:
- Clustering Performance: Evaluated using B-Cubed metrics (Recall, Precision, F1) to gauge how well the disambiguation tool clusters name instances correctly.
- Classification Performance: Metric based on how accurately Author-ity2009 could classify self-citation pairs as matched or non-matched.
Results:
- Precision and Recall:
- Author-ity2009 achieves an impressive precision score (~0.99) across datasets, significantly surpassing other baseline heuristics like full initials-based (AINI) and first initials-based (FINI) matches.
- However, recall rates exhibit variability across different datasets and ethnicity types, with Author-ity2009 often narrowing the performance gap when dealing with ambiguous names.
- Per-Ethnicity Analysis:
- The method demonstrates a consistent ability to precisely parse Hispanic, Indian, and Korean names, where other heuristics falter.
- Notably, ethnic groups like Chinese and Korean, known for their disambiguation challenges, highlight the sophisticated capabilities of Author-ity2009 compared against simpler strategies.
Implications:
- ORCID as a Tool for Disambiguation:
- ORCID-based linkage proves beneficial in creating robust datasets for large-scale disambiguation tasks. It offers a comprehensive cross-disciplinary coverage compared to more restricted datasets like NIH-funded profiles.
- This study illustrates the scalability of ORCID linkages, positioning it as a valuable alternative for generating labeled data without the intensive labor involved in manual labeling processes.
- Dataset Comparison:
- AUT-ORC is particularly valuable by capturing diverse name ambiguities across ethnicities. In contrast, AUT-NIH skews towards less ambiguous instances due to the prominence of English names.
- Self-citation-based data (AUT-SCT) is primarily useful for evaluating recall but not precision, marking ORCID-linked datasets as crucial for holistic performance assessments.
Conclusion:
The study proposes ORCID-linkage as a promising method for constructing labeled datasets that not only scale effectively but also provide nuanced details on disambiguation performance at an ethnic level. While ORCID presents certain limitations, such as underrepresentation in specific researcher demographics, its growing adoption and public availability present significant opportunities for enhancing bibliographic data disambiguation. Researchers benefit from the data’s availability, allowing further developments and validations of disambiguation models. Future work should address potential biases and completeness of ORCID data to improve its utility as a universal tool in author name disambiguation.