ORCID-linked labeled data for evaluating author name disambiguation at scale

Published 5 Feb 2021 in cs.DL and cs.IR | (2102.03237v1)

Abstract: How can we evaluate the performance of a disambiguation method implemented on big bibliographic data? This study suggests that the open researcher profile system, ORCID, can be used as an authority source to label name instances at scale. This study demonstrates the potential by evaluating the disambiguation performances of Author-ity2009 (which algorithmically disambiguates author names in MEDLINE) using 3 million name instances that are automatically labeled through linkage to 5 million ORCID researcher profiles. Results show that although ORCID-linked labeled data do not effectively represent the population of name instances in Author-ity2009, they do effectively capture the 'high precision over high recall' performances of Author-ity2009. In addition, ORCID-linked labeled data can provide nuanced details about the Author-ity2009's performance when name instances are evaluated within and across ethnicity categories. As ORCID continues to be expanded to include more researchers, labeled data via ORCID-linkage can be improved in representing the population of a whole disambiguated data and updated on a regular basis. This can benefit author name disambiguation researchers and practitioners who need large-scale labeled data but lack resources for manual labeling or access to other authority sources for linkage-based labeling. The ORCID-linked labeled data for Author-tiy2009 are publicly available for validation and reuse.

Abstract PDF Upgrade to Chat

Citations (15)

View on Semantic Scholar

Summary

The paper demonstrates a novel method using over 5 million ORCID profiles to generate large-scale labeled datasets for evaluating author name disambiguation tools like Author-ity2009.
Evaluation shows tools like Author-ity2009 achieve high precision (around 0.99), outperforming simple heuristics, and perform well on challenging names across various ethnicities.
ORCID linkage provides a scalable, diverse data source for robust evaluation of disambiguation tools, highlighting their strengths and weaknesses for different name types.

The paper "ORCID-linked labeled data for evaluating author name disambiguation at scale" by Jinseok Kim and Jason Owen-Smith addresses the challenge of evaluating author name disambiguation methods in large-scale bibliographic datasets, with a specific focus on leveraging ORCID as an authoritative source for creating labeled data. ORCID, an open researcher and contributor ID system, provides researcher profiles containing authorship information that can be linked to bibliographic records to generate large-scale labeled datasets. This study utilizes over 5 million ORCID profiles to evaluate the disambiguation capabilities of the Author-ity2009, a tool employed for disambiguating author names in the extensive MEDLINE database.

Methodology

Data Linking and Collection:
- The paper develops datasets by linking MEDLINE name instances with ORCID and NIH-funded researcher profiles. It also employs self-citation information to generate labeled datasets.
- Three labeled datasets are constructed:
  - AUT-ORC: Linking Author-ity2009 with ORCID profiles, resulting in 3 million labeled instances.
  - AUT-NIH: Linking Author-ity2009 with NIH PI data, offering 313K labeled instances.
  - AUT-SCT: Using self-citation data to generate more than 6 million instance pairs.
Evaluation Criteria:
- Clustering Performance: Evaluated using B-Cubed metrics (Recall, Precision, F1) to gauge how well the disambiguation tool clusters name instances correctly.
- Classification Performance: Metric based on how accurately Author-ity2009 could classify self-citation pairs as matched or non-matched.

Results:

Precision and Recall:
- Author-ity2009 achieves an impressive precision score (~0.99) across datasets, significantly surpassing other baseline heuristics like full initials-based (AINI) and first initials-based (FINI) matches.
- However, recall rates exhibit variability across different datasets and ethnicity types, with Author-ity2009 often narrowing the performance gap when dealing with ambiguous names.
Per-Ethnicity Analysis:
- The method demonstrates a consistent ability to precisely parse Hispanic, Indian, and Korean names, where other heuristics falter.
- Notably, ethnic groups like Chinese and Korean, known for their disambiguation challenges, highlight the sophisticated capabilities of Author-ity2009 compared against simpler strategies.

Implications:

ORCID as a Tool for Disambiguation:
- ORCID-based linkage proves beneficial in creating robust datasets for large-scale disambiguation tasks. It offers a comprehensive cross-disciplinary coverage compared to more restricted datasets like NIH-funded profiles.
- This study illustrates the scalability of ORCID linkages, positioning it as a valuable alternative for generating labeled data without the intensive labor involved in manual labeling processes.
Dataset Comparison:
- AUT-ORC is particularly valuable by capturing diverse name ambiguities across ethnicities. In contrast, AUT-NIH skews towards less ambiguous instances due to the prominence of English names.
- Self-citation-based data (AUT-SCT) is primarily useful for evaluating recall but not precision, marking ORCID-linked datasets as crucial for holistic performance assessments.

Conclusion:

The study proposes ORCID-linkage as a promising method for constructing labeled datasets that not only scale effectively but also provide nuanced details on disambiguation performance at an ethnic level. While ORCID presents certain limitations, such as underrepresentation in specific researcher demographics, its growing adoption and public availability present significant opportunities for enhancing bibliographic data disambiguation. Researchers benefit from the data’s availability, allowing further developments and validations of disambiguation models. Future work should address potential biases and completeness of ORCID data to improve its utility as a universal tool in author name disambiguation.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

ORCID-linked labeled data for evaluating author name disambiguation at scale

Summary

Methodology

Results:

Implications:

Conclusion:

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (2)

Collections

ORCID-linked labeled data for evaluating author name disambiguation at scale

Summary

Methodology

Results:

Implications:

Conclusion:

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections