- The paper presents a reproducible, cloud-based pipeline that extracts, harmonizes, and classifies multilingual Wikipedia citations from structured dumps.
- It implements citation template harmonization, deterministic classification using identifiers, and external lookup to enrich 38% of scientific citations.
- The scalable, multilingual approach enables cross-cultural citation analysis and reinforces Wikipedia’s integration within the academic open science ecosystem.
Wikipedia has become a crucial platform within the open science ecosystem, yet there remains a gap in its integration with academic open science initiatives. The paper "Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia" by Natallia Kokash and Giovanni Colavizza addresses this challenge by proposing an open-source software pipeline designed to extract, classify, and disambiguate citations from Wikipedia dumps in a structured and reproducible manner.
Overview
The primary contribution of the paper is the development of a cloud-based pipeline that allows the extraction of comprehensive citation datasets from Wikipedia dumps. Initially, the authors extracted 29.3 million citations from English Wikipedia in May 2020 as a one-off project. Building upon this, they designed a reproducible pipeline demonstrated by extracting 40.6 million citations in February 2023 and 44.7 million citations in February 2024. The pipeline supports 15 European languages, translating and mapping citation templates from different language versions into a uniform English-based structured format.
Citation Extraction and Processing
The pipeline incorporates three main steps:
- Citation template harmonization
- Classification
- Citation identifier lookup
It retrieves and harmonizes citation templates in XML Wikipedia dumps, converting them into a uniform key-value dataset amenable to further processing. The authors extend an existing MediaWiki parser to support additional frequently used citation templates, enhancing the ability to map various citation structures into a common schema.
Classification and Lookup Procedures
The classification process categorizes citations into "book," "journal," "news," and "other" categories using available identifiers such as DOI, PMC, PMID, and ISBN. URLs from a pre-compiled list of reputable news media domains are also used to identify news citations. Notably, the paper emphasizes the conservative nature of their deterministic classification to avoid biases inherent in training set labeling methods seen in previous studies.
The lookup process augments citations lacking identifiers by querying external citation databases like Crossref and Google Books. A crucial insight from the paper is that a significant proportion (38%) of potentially scientific citations can be augmented with identifiers through this process, highlighting gaps in current citation database coverage.
Results and Implications
The paper provides a detailed evaluation of the pipeline's performance over different language dumps:
- For English Wikipedia, the pipeline identified a 10% increase in the overall number of citations from 2023 to 2024.
- The "sci score," indicating the fraction of journal citations, was calculated at 5%, aligning with previous studies.
- A broader classification, including books, raised the "sci score" to 12.34%.
The multilingual analysis reveals varying citation practices across languages, with a comprehensive table detailing percentage breakdowns of citation types and reliable sources. The results demonstrate the pipeline’s adaptability to different language structures, facilitating cross-linguistic research on Wikipedia.
Theoretical and Practical Implications
Practically, the open-source nature of the pipeline makes it accessible to researchers and citation consolidation bodies, enabling continuous updates and studies on Wikipedia citations. Theoretically, the dataset can facilitate analyses of citation practices, knowledge dissemination, and the integration of Wikipedia within the academic ecosystem. The manual mapping of language-specific citation templates to a common structure fosters comparative studies on citation practices across cultures and languages.
Future Directions
Future developments could focus on expanding the number of languages supported by the pipeline and further refining the classification algorithms to include additional identifiers. Collaborative integration with other open science platforms could enhance the utility of Wikipedia as a reliable and verifiable source. Additionally, the paper's insights on citation database coverage could stimulate efforts to improve the comprehensiveness and accessibility of citation indices.
Conclusion
This work presents a significant advancement in the integration of Wikipedia within the open science framework by providing a reproducible, scalable tool for citation extraction and analysis. It bridges the gap between Wikipedia and academic research, enhancing verifiability and fostering a more interconnected knowledge ecosystem. The dataset and codebase offer extensive opportunities for future research and collaboration.