Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, Pubmed and Semantic Scholar (2406.15154v1)

Published 21 Jun 2024 in cs.DL

Abstract: This study compares and analyses publication and document types in the following bibliographic databases: OpenAlex, Scopus, Web of Science, Semantic Scholar and PubMed. The results demonstrate that typologies can differ considerably between individual database providers. Moreover, the distinction between research and non-research texts, which is required to identify relevant documents for bibliometric analysis, can vary depending on the data source because publications are classified differently in the respective databases. The focus of this study, in addition to the cross-database comparison, is primarily on the coverage and analysis of the publication and document types contained in OpenAlex, as OpenAlex is becoming increasingly important as a free alternative to established proprietary providers for bibliometric analyses at libraries and universities.

Citations (4)

View on Semantic Scholar

Summary

The paper presents a detailed comparison of document and publication type classifications across OpenAlex, WoS, Scopus, PubMed, and Semantic Scholar.
It reveals significant variability in categorization practices, with proprietary databases providing finer granularity than open platforms.
The findings emphasize the need for standardized classification to improve the reliability of bibliometric analyses and research evaluations.

Analysis of Publication and Document Types in OpenAlex, Web of Science, Scopus, PubMed, and Semantic Scholar

This paper presents an in-depth comparative analysis of publication and document types across prominent bibliographic databases: OpenAlex, Web of Science (WoS), Scopus, PubMed, and Semantic Scholar. The paper meticulously dissects the classification practices and typological coverage within these databases, highlighting inherent differences and their implications for bibliometric research.

Key Findings

The paper's results indicate significant variability across databases regarding how they classify publication and document types. Web of Science and Scopus exhibit more granular categorization compared to the broader classifications of OpenAlex and Semantic Scholar. Notably, OpenAlex, as an open bibliographic database, displays a lower coverage of specific publication and document types relative to proprietary platforms like WoS and Scopus. PubMed, although specialized and extensive in its categorization schemes, exhibited a lower alignment with generalist databases in terms of typology overlap.

Typological Coverage and Classification Practices

OpenAlex: Displays comprehensive yet broad coverage with 16 document types and moderate overlap with proprietary databases. It integrates classifications from Crossref but lacks in-depth granularity found in WoS or Scopus.
Scopus and WoS: These databases report near-complete coverage across 18 and 87 document type categories, respectively. They maintain sophisticated, multi-layered classification mechanisms that prove useful in differentiating between multiple research and editorial document types.
PubMed: Primarily biomedical, it maintains 79 distinct document types but often lacks the publication type granularity seen in databases like Scopus. Its use of MeSH terms for classification provides a unique but sometimes restricted comparison base.
Semantic Scholar: Exhibits the least comprehensive coverage with 12 document types and a pronounced deficiency in publication type assignment (only 43.6% coverage), reflecting a significant gap in metadata completeness.

Implications for Bibliometric Research

The variability in typologies across these databases poses challenges for cross-database bibliometric analyses. Differences in classification practices affect the calculation of bibliometric indicators, such as citation rates and impact factors. For robust and reproducible results, it is crucial to standardize typologies or accommodate these differences methodologically.

For instance, studies that exclusively focus on research articles may inadvertently include editorials or reviews, leading to skewed metrics. The ability of databases to differentiate between various types of publications influences the reliability of bibliometric studies. Therefore, enhancing the granularity and accuracy of document classification in open databases like OpenAlex and Semantic Scholar becomes imperative.

Practical and Theoretical Implications

Practical Implications

Data Quality and Accuracy: Improving document type classification, particularly in open databases, can enhance the reliability of bibliometric analyses conducted by libraries and academic institutions.
Standardization Efforts: The research underscores the need for standardized document types across databases. This is particularly pertinent for studies aiming to use multiple data sources to triangulate findings.

Theoretical Implications

Classification Schemes: The paper highlights the importance of refined classification schemes that go beyond simple document-type tags. Including dimensions such as paper types can provide richer metadata.
Epistemic Characteristics: Classification practices should reflect the epistemic and textual characteristics of documents. A move towards semantically enriched metadata could aid in achieving this.

Future Directions in Database Classification

Upcoming advances may include leveraging full-text analyses and machine learning techniques to enhance classification accuracy. Parsing the full text to identify semantic markers indicative of document types and paper designs could supply databases with a more detailed and precise classification schema. Moreover, integrating authors' self-categorization at submission points and enforcing strict editorial guidelines could refine the classification further.

Conclusion

The paper provides a comprehensive landscape of the publication and document type classifications across major bibliometric databases. It delineates the necessity for improved granularity and standardization, especially in open databases like OpenAlex and Semantic Scholar. Enhanced classification practices will significantly benefit the scientific community by ensuring more reliable and meaningful bibliometric indicators, ultimately facilitating better research evaluation and policy decisions. Future research should continue to explore sophisticated methodologies for document categorization, integrating advances in text analysis and AI-driven classification systems.

Related Papers

Tweets

https://twitter.com/ASchniedermann/status/1808017344167321928

https://twitter.com/jack_culbert/status/1805229712387588194

https://twitter.com/Marek_Kwiek/status/1805284741106573318