Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts (2205.01833v2)

Published 4 May 2022 in cs.DL

Abstract: OpenAlex is a new, fully-open scientific knowledge graph (SKG), launched to replace the discontinued Microsoft Academic Graph (MAG). It contains metadata for 209M works (journal articles, books, etc); 2013M disambiguated authors; 124k venues (places that host works, such as journals and online repositories); 109k institutions; and 65k Wikidata concepts (linked to works via an automated hierarchical multi-tag classifier). The dataset is fully and freely available via a web-based GUI, a full data dump, and high-volume REST API. The resource is under active development and future work will improve accuracy and coverage of citation information and author/institution parsing and deduplication.

Citations (178)

Summary

  • The paper presents OpenAlex as a fully-open alternative to Microsoft Academic Graph, offering comprehensive indexing of works, authors, venues, institutions, and concepts.
  • It details a heterogeneous directed graph of five entity types, enhancing transparency and interoperability in scholarly metadata.
  • The study demonstrates robust data integration with daily updates and standardized identifiers, fostering improved research evaluation and collaborative access.

OpenAlex: A Fully-Open Index of Scholarly Works

The paper "OpenAlex: A Fully-Open Index of Scholarly Works, Authors, Venues, Institutions, and Concepts" presents a comprehensive discussion of OpenAlex, an innovative open-source project launched in response to the retirement of the Microsoft Academic Graph (MAG). Microsoft Academic Graph played a critical role as a free and widely utilized Scientific Knowledge Graph (SKG), and its discontinuation raised concerns over the availability of open scholarly metadata. OpenAlex was developed to provide a drop-in replacement, leveraging a fully open framework to enhance transparency in research evaluation and scholarly navigation.

OpenAlex is structured as a heterogeneous directed graph consisting of five primary entity types: works, authors, venues, institutions, and concepts. These entities are intricately connected, enabling accurate and comprehensive representation of scholarly relationships. The paper provides detailed insights into each entity type, emphasizing the importance of persistent OpenAlex IDs which serve as primary keys and increase interoperability through alignment with external canonical identifiers (CEIDs).

The magnitude of OpenAlex is evident in its extensive indexing capabilities. With approximately 209 million works documented and an addition of over 50,000 works daily, OpenAlex demonstrates significant coverage of scholarly outputs. Works are indexed through the retrieval of metadata from various sources, such as Crossref and PubMed, enhancing the breadth of data integration. Crucially, the paper notes that about half of the indexed works have DOI CEIDs, illustrating a key aspect of metadata standardization within the platform.

Authors, a critical entity type, are indexed in vast quantities, with around 213 million authors recorded within OpenAlex. Despite the low percentage of authors with ORCID identifiers, the system leverages ORCID as a disambiguation feature, indicating robust algorithmic capabilities in managing author metadata. Moreover, OpenAlex accommodates affiliations through the authorship object, facilitating connections between authors and works.

Venues, defined as places hosting works, are indexed with around 124,000 entries. The paper stresses the importance of ISSN-L, with 90% of venues possessing this identifier, enhancing the data's organizational integrity. Notably, the platform deploys fingerprinting algorithms to distinguish between different work versions, establishing a sophisticated framework for metadata management.

Institutions are represented through indexed affiliations of authors, totaling around 109,000 entries. The paper underscores the integration of ROR IDs for institutions, presenting an effective mapping strategy for structured and unstructured affiliation data.

Concepts are abstract ideas encapsulated within works, and OpenAlex indexes around 65,000 conceptual entries. The use of Wikidata IDs reflects the platform's commitment to interoperability, while hierarchical arrangements facilitate efficient exploration of subject matter.

OpenAlex offers users open distribution through three key methods: data dumps, REST API, and a web-based GUI. This framework is conducive to accelerated access and utilization, with the REST API notably featuring no rate limits. The open-source code available on GitHub amplifies community engagement and fosters collaborative improvement.

The paper recognizes several existing limitations, such as the need for enhanced parsing and disambiguation of entities, especially authors and institutions. The absence of metadata related to funding sources is also highlighted as a potential area for development.

In conclusion, OpenAlex emerges as a promising open-source alternative for scholarly metadata, with significant implications for researchers and institutions alike. The project exemplifies a pioneering approach towards open scholarly infrastructure, with prospects for continued development and refinement in areas crucial to academic evaluation and representation. Future directions for OpenAlex may encompass strategizing improved data completeness and accuracy alongside expanding metadata dimensions available for analysis.