OpenScholar-DataStore: Unified Scholarly Data

Updated 25 March 2026

OpenScholar-DataStore is an open-source infrastructure that unifies scholarly records from eight major sources while preserving each source's native schema.
It employs a DuckDB-based 'views-over-Parquet' system to deliver zero-administration, columnar speed analytics over hundreds of millions of research outputs.
The platform integrates embedding-based ontology alignment and rigorous DOI normalization to ensure semantic enrichment and reliable cross-source citation analytics.

OpenScholar-DataStore is an open-source, large-scale infrastructure designed to unify, organize, and provide programmatic access to scholarly records, supporting advanced analysis, literature search, and integrative science-of-science workflows. This resource aggregates diverse bibliometric, full-text, and metadata information at a scale spanning hundreds of millions of research outputs, with an emphasis on schema preservation, semantic enrichment, robust citation normalization, and compatibility with modern retrieval-augmented machine learning pipelines.

1. System Architecture and Data Organization

At its core, OpenScholar-DataStore (“the DataStore”) is a “views-over-Parquet” system built on DuckDB, consisting of approximately 960 GB of compressed Apache Parquet files and a sub-300 KB DuckDB catalog. The system unifies eight major open scholarly sources—Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref—while preserving their native structures by assigning each to a discrete schema/namespace. No metadata fields from any source are collapsed or discarded, promoting full fidelity and lossless integration (Wilinski, 3 Mar 2026).

This architecture is driven by three practical considerations:

Zero-administration: DuckDB is an embeddable, serverless OLAP engine, requiring only Python (or a binary) for operation, obviating traditional database provisioning.
Columnar speed: The vectorized, columnar execution enables efficient analytics across several hundred million rows for JOINs, aggregations, and filtering.
Direct Parquet access: Each schema directory points directly to Parquet files, eliminating heavyweight imports and minimizing metadata storage overhead.

Data is organized into 22 schemas mapped to source-specific directories and tables, such as s2ag.papers, openalex.works, sciscinet.paper_metrics, and retwatch.retracted_papers. An orchestrating CLI (datalake_cli.py) automates data fetching, format conversion (via PyArrow), DuckDB catalog generation, and materialization of cross-reference tables. The structure is fully documented by a 1200-line machine-readable SCHEMA.md optimized for LLM agents, which enables automatic SQL composition and multi-schema joins.

The DataStore’s design enables local single-drive deployment or remote querying via DuckDB’s HTTPFS extension, fetching data from HuggingFace Datasets under hf:// URLs for subsets not gated by restrictive licenses (Wilinski, 3 Mar 2026).

2. Data Ingestion, DOI Normalization, and Cross-Source Integration

The ingestion pipeline preserves the native schema of each data provider. For example:

s2ag.papers, s2ag.authors, and s2ag.citations (Semantic Scholar Academic Graph)
openalex.works and openalex.topics (OpenAlex)
sciscinet.paper_metrics (SciSciNet disruption and atypicality metrics)
pwc.papers (Papers with Code)
retwatch.retracted_papers (Retraction Watch)
ros.patent_paper_pairs (Reliance on Science patent citation pairs)
p2p maps preprint DOIs to published DOIs
Crossref is employed for DOI and reference validation/enrichment (Wilinski, 3 Mar 2026)

Throughout, all DOIs undergo canonical normalization: lowercasing and prefix-stripping (removal of “https://doi.org/”), ensuring strict referential integrity across disparate sources. The xref.doi_map view handles all source-specific DOI transformations, while xref.unified_papers materializes a 293-million-row table aggregating all DOI-linked records with 29 columns and six Boolean source flags. These flags (e.g., has_openalex, has_s2ag) indicate provenance and enable fine-grained filtering for analytic reproducibility or source-specific studies. Overlap statistics (e.g., OpenAlex covers 99.67% of unified DOIs, SciSciNet 54.08%, S2AG 45.52%) quantify relative source coverage (Wilinski, 3 Mar 2026).

3. Ontology Alignment and Semantic Enrichment

OpenScholar-DataStore implements an embedding-based cross-ontology alignment framework to address heterogeneity in subject taxonomies, particularly the flat OpenAlex topics (4,516 leaves) lacking formal semantics. This process links each topic to up to 13 domain-specific ontologies—MeSH, ChEBI, NCIT, GO, AGROVOC, CSO, DOID, HPO, EDAM, UNESCO Thesaurus, STW, PhySH, MSC2020—encompassing approximately 1.3 million terms (Wilinski, 3 Mar 2026).

For the 10 smaller ontologies (≈291,000 terms), 1024-dimensional BGE-large sentence embeddings are computed for both topic and ontology term labels, with synonym expansion. Approximate nearest-neighbor search is performed via FAISS to identify candidate matches. For the three largest biomedical ontologies (MeSH, ChEBI, NCIT; ≈1.1 million terms), strict string-matching is used to ensure precision in the long tail.

The resulting xref.topic_ontology_map encodes mappings with similarity scores. At the recommended threshold (similarity ≥ 0.85), precision = 0.67, recall = 0.89, and $F_1 ≈ 0.77$ ; relaxing the threshold to 0.65 achieves 99.8% coverage (4,509/4,516 topics, producing 16,150 mappings). Lexical baselines (TF-IDF, BM25, Jaro–Winkler) underperform, with $F_1$ of only 0.71, 0.61, and 0.63, respectively. In manual stratified annotation, precision at similarity ≥ 0.95 is empirically ≈1.00, and almost all exact/high-quality matches above the 0.85 threshold are semantically correct (Wilinski, 3 Mar 2026).

4. Validation, Quality Control, and Cross-Source Reliability

Validation consists of ten automated checks and additional manual annotation. These include:

Universal DOI format conformity (zero violations over 293 million records)
Source coverage flags matched against actual data
Primary key uniqueness enforcement
OpenAlex ID/format and topic joinability
No orphan topic IDs in the ontology mapping
Patent citation mapping (86% match on 10,000 samples)
Cross-source citation count correlations
Publication year null/invalid rate checks
Manual checks on high-profile retractions
Reproduction of key result counts from analytic vignettes

Pairwise citation count reliability across 121 million overlapping papers in S2AG, OpenAlex, and SciSciNet yields Pearson correlation coefficients $r$  = 0.76–0.87. Bland–Altman analysis reveals that disagreement scales with citation magnitude and exposes rare outlier inconsistencies (e.g., papers with 257,887 citations in S2AG but zero in OpenAlex). Manual annotation on 300 topic-ontology pairs confirms that, above the recommended similarity threshold, erroneous semantic assignments are virtually absent (Wilinski, 3 Mar 2026).

5. Analytical Applications and Vignettes

The unified architecture enables cross-source analyses that are impossible with any single database. Four example vignettes illustrate the system’s capabilities:

Disruption, Code Adoption, and Ontology Landscape: By joining SciSciNet's disruption index (CD₅), the unified paper table, and the topic-ontology map, the analysis quantifies whether papers with code (flagged by has_pwc) show different disruption profiles across different ontology domains. E.g., code-releasing papers (0.048% of corpus) have mean CD₅ = –0.0005, versus 0.0026 for non-code papers—a subtle but quantifiable signal.
Retraction Profiles and Ontology Enrichment: Joining retracted_papers, paper_metrics, citation data, and the ontology map, it is possible to compute ontology-level retraction enrichment, identifying highly overrepresented domains (e.g., 394× in "AI Applications").
Patent Impact and Multi-Ontology Footprinting: Profiling the 312,929 patent-cited papers reveals substantially higher mean citation and field-weighted citation impact (mean=94.3 and FWCI=4.7) compared to non-patent-cited papers (16.1, FWCI=1.5), further stratified by mapped ontologies.
Cross-Source Citation Reliability: Triple-overlap papers allow for comparison of source-specific citation distributions, mean absolute errors, and systematic source biases as a function of citation magnitude (Wilinski, 3 Mar 2026).

6. Engineering, Documentation, and Programmatic Access

Deployment is streamlined: cloning the repository, running the master CLI (“datalake_cli.py download && convert && create_views && materialize_unified_papers && build_embedding_linkage”), and ensuring ≈1 TB of free storage yields a ready-to-query DuckDB database. All Parquet files (except those under stringent licenses) can be queried remotely as if local when using DuckDB’s HTTPFS extension.

The documentation (SCHEMA.md) exhaustively enumerates every view, column, row count, and recommended join pattern—explicitly designed to be machine-readable for LLM-based text-to-SQL agent compatibility (Wilinski, 3 Mar 2026).

7. Significance, Availability, and Ecosystem Position

OpenScholar-DataStore constitutes a fully open, locally or remotely deployable, richly validated, and semantically interoperable foundation for integrative research in science-of-science, bibliometrics, and retrieval-based scholarly language modeling. Its preservation of source schemas, embedding-based ontology alignment, and documented, referentially normalized structure enable analyses in code adoption, scientific misconduct, patent influence, and citation reliability across the global publication records not possible with single-source platforms. All source code, documentation, and the vast majority of the data are publicly available, facilitating reproducibility and extension by the broader research community (Wilinski, 3 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenScholar-DataStore.