Papers
Topics
Authors
Recent
Search
2000 character limit reached

ORCID-linked labeled data for evaluating author name disambiguation at scale

Published 5 Feb 2021 in cs.DL and cs.IR | (2102.03237v1)

Abstract: How can we evaluate the performance of a disambiguation method implemented on big bibliographic data? This study suggests that the open researcher profile system, ORCID, can be used as an authority source to label name instances at scale. This study demonstrates the potential by evaluating the disambiguation performances of Author-ity2009 (which algorithmically disambiguates author names in MEDLINE) using 3 million name instances that are automatically labeled through linkage to 5 million ORCID researcher profiles. Results show that although ORCID-linked labeled data do not effectively represent the population of name instances in Author-ity2009, they do effectively capture the 'high precision over high recall' performances of Author-ity2009. In addition, ORCID-linked labeled data can provide nuanced details about the Author-ity2009's performance when name instances are evaluated within and across ethnicity categories. As ORCID continues to be expanded to include more researchers, labeled data via ORCID-linkage can be improved in representing the population of a whole disambiguated data and updated on a regular basis. This can benefit author name disambiguation researchers and practitioners who need large-scale labeled data but lack resources for manual labeling or access to other authority sources for linkage-based labeling. The ORCID-linked labeled data for Author-tiy2009 are publicly available for validation and reuse.

Citations (15)

Summary

  • The paper demonstrates a novel method using over 5 million ORCID profiles to generate large-scale labeled datasets for evaluating author name disambiguation tools like Author-ity2009.
  • Evaluation shows tools like Author-ity2009 achieve high precision (around 0.99), outperforming simple heuristics, and perform well on challenging names across various ethnicities.
  • ORCID linkage provides a scalable, diverse data source for robust evaluation of disambiguation tools, highlighting their strengths and weaknesses for different name types.

The paper "ORCID-linked labeled data for evaluating author name disambiguation at scale" by Jinseok Kim and Jason Owen-Smith addresses the challenge of evaluating author name disambiguation methods in large-scale bibliographic datasets, with a specific focus on leveraging ORCID as an authoritative source for creating labeled data. ORCID, an open researcher and contributor ID system, provides researcher profiles containing authorship information that can be linked to bibliographic records to generate large-scale labeled datasets. This study utilizes over 5 million ORCID profiles to evaluate the disambiguation capabilities of the Author-ity2009, a tool employed for disambiguating author names in the extensive MEDLINE database.

Methodology

  1. Data Linking and Collection:
    • The paper develops datasets by linking MEDLINE name instances with ORCID and NIH-funded researcher profiles. It also employs self-citation information to generate labeled datasets.
    • Three labeled datasets are constructed:
      • AUT-ORC: Linking Author-ity2009 with ORCID profiles, resulting in 3 million labeled instances.
      • AUT-NIH: Linking Author-ity2009 with NIH PI data, offering 313K labeled instances.
      • AUT-SCT: Using self-citation data to generate more than 6 million instance pairs.
  2. Evaluation Criteria:
    • Clustering Performance: Evaluated using B-Cubed metrics (Recall, Precision, F1) to gauge how well the disambiguation tool clusters name instances correctly.
    • Classification Performance: Metric based on how accurately Author-ity2009 could classify self-citation pairs as matched or non-matched.

Results:

  • Precision and Recall:
    • Author-ity2009 achieves an impressive precision score (~0.99) across datasets, significantly surpassing other baseline heuristics like full initials-based (AINI) and first initials-based (FINI) matches.
    • However, recall rates exhibit variability across different datasets and ethnicity types, with Author-ity2009 often narrowing the performance gap when dealing with ambiguous names.
  • Per-Ethnicity Analysis:
    • The method demonstrates a consistent ability to precisely parse Hispanic, Indian, and Korean names, where other heuristics falter.
    • Notably, ethnic groups like Chinese and Korean, known for their disambiguation challenges, highlight the sophisticated capabilities of Author-ity2009 compared against simpler strategies.

Implications:

  • ORCID as a Tool for Disambiguation:
    • ORCID-based linkage proves beneficial in creating robust datasets for large-scale disambiguation tasks. It offers a comprehensive cross-disciplinary coverage compared to more restricted datasets like NIH-funded profiles.
    • This study illustrates the scalability of ORCID linkages, positioning it as a valuable alternative for generating labeled data without the intensive labor involved in manual labeling processes.
  • Dataset Comparison:
    • AUT-ORC is particularly valuable by capturing diverse name ambiguities across ethnicities. In contrast, AUT-NIH skews towards less ambiguous instances due to the prominence of English names.
    • Self-citation-based data (AUT-SCT) is primarily useful for evaluating recall but not precision, marking ORCID-linked datasets as crucial for holistic performance assessments.

Conclusion:

The study proposes ORCID-linkage as a promising method for constructing labeled datasets that not only scale effectively but also provide nuanced details on disambiguation performance at an ethnic level. While ORCID presents certain limitations, such as underrepresentation in specific researcher demographics, its growing adoption and public availability present significant opportunities for enhancing bibliographic data disambiguation. Researchers benefit from the data’s availability, allowing further developments and validations of disambiguation models. Future work should address potential biases and completeness of ORCID data to improve its utility as a universal tool in author name disambiguation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.