Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness

Published 16 Sep 2024 in cs.DL and cs.DB | (2409.10633v2)

Abstract: Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades the main sources of bibliometric information. Although highly curated, these closed, proprietary databases are largely biased towards English-language publications, underestimating the use of other languages in research dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive, and open-source research information. While already in use by scholars and research institutions, the quality of its metadata is currently being assessed. This paper contributes to this literature by assessing the completeness and accuracy of OpenAlex's metadata related to language, through a comparison with WoS, as well as an in-depth manual validation of a sample of 6,836 articles. Results show that OpenAlex exhibits a far more balanced linguistic coverage than WoS. However, language metadata is not always accurate, which leads OpenAlex to overestimate the place of English while underestimating that of other languages. If used critically, OpenAlex can provide comprehensive and representative analyses of languages used for scholarly publishing. However, more work is needed at infrastructural level to ensure the quality of metadata on language.

Abstract PDF HTML Upgrade to Chat

Authors (13)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that OpenAlex misclassifies 14.7% of language metadata, impacting the reliability of non-English article representation.
Methodological comparison using precision, recall, and balanced accuracy metrics shows robust performance for Spanish and Portuguese but weaknesses for Chinese and Korean.
The study emphasizes the need for improved metadata quality to better capture global linguistic diversity in scholarly communication.

Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness

Introduction

The paper "Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness" investigates the linguistic metadata precision and completeness in the recently launched OpenAlex database. OpenAlex has positioned itself as an inclusive and comprehensive alternative to established bibliometric databases like Web of Science (WoS) and Scopus. This study, however, seeks to evaluate its effectiveness, specifically its accuracy in representing non-English publications within the broader scholarly communication context.

Research Questions and Methodology

The researchers address three core research questions:

How does the linguistic coverage of OpenAlex compare to that of WoS?
How accurate is the article-level language labeling in OpenAlex metadata?
What are the sources of language confusion in OpenAlex?

The research utilized OpenAlex data available as JSON objects and compared it against WoS. It involved three rounds of manual verification of a stratified sample of articles across different languages. A team of 14 coders manually validated the language of 6,836 articles, identifying discrepancies between OpenAlex's language metadata and the articles' actual language. The validated data was further analyzed using precision, recall, and balanced accuracy metrics to understand the accuracy better.

Results

Comparative Analysis

Linguistic Distribution: OpenAlex shows a more varied linguistic landscape compared to WoS. Only 75% of articles in OpenAlex are declared to be in English, compared to 96% in WoS. Misclassification resulted in an even more diverse distribution than initially reported.
Accuracy of Language Labels: The study found that 14.7% of linguistic metadata in OpenAlex consists of false positives. This misclassification primarily stems from articles written in languages other than English being labeled in English. The multiclass confusion matrix highlighted that Russian (95.5% correct classification) and Spanish (over 90%) articles are well-identified. However, Chinese and Korean articles showed significant misclassification.

Evaluation Metrics

Precision and Recall: Chinese exhibits the lowest recall and balanced accuracy. Russian, while showing high precision, also suffers from low recall. Spanish and Portuguese, on the other hand, demonstrate high performance across all metrics, indicating reliable classification.
Linguistic Diversity: Corrected figures suggest English articles represent 68%, highlighting a substantial presence of non-English articles. Notably, Russian and Chinese frequencies increased significantly upon validation, indicating more articles in these languages than originally declared by OpenAlex.

Implications

The results underscore the importance of accurate metadata for reliable bibliometric analyses and the dissemination of non-English scholarly works. OpenAlex, despite inaccuracies, provides a more authentic picture of the global linguistic diversity in scholarly communication compared to proprietary, English-centric databases like WoS.

For practical applications, the paper suggests:

Confidence in Data Use: Researchers can confidently use OpenAlex for English and Spanish articles. However, when analyzing languages like Chinese and Korean, results should be triangulated with other sources for accuracy.
Policy and Infrastructure Development: The findings advocate for improved metadata quality at the source, reinforcing the importance of open, interoperable repositories and infrastructures, particularly for minoritized languages.

Future Directions

The study positions itself as an introductory investigation into OpenAlex's linguistic coverage, inviting further research to:

Enhance metadata accuracy by broadening the scope to include a more diverse range of languages.
Investigate the interplay between open-access repositories and metadata completeness.
Address classification and document-type accuracies within OpenAlex to enhance its utility for bibliometric analysis and policymaking.

The dynamic nature of OpenAlex promises improvements as feedback and community engagement continue. Future contributions to this platform could influence its evolution into a more accurate and trusted resource in the scientific information ecosystem.

Markdown Report Issue