Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia

Published 27 Jun 2024 in cs.DL | (2406.19291v1)

Abstract: Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a reproducible, cloud-based pipeline that extracts, harmonizes, and classifies multilingual Wikipedia citations from structured dumps.
It implements citation template harmonization, deterministic classification using identifiers, and external lookup to enrich 38% of scientific citations.
The scalable, multilingual approach enables cross-cultural citation analysis and reinforces Wikipedia’s integration within the academic open science ecosystem.

Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia

Wikipedia has become a crucial platform within the open science ecosystem, yet there remains a gap in its integration with academic open science initiatives. The paper "Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia" by Natallia Kokash and Giovanni Colavizza addresses this challenge by proposing an open-source software pipeline designed to extract, classify, and disambiguate citations from Wikipedia dumps in a structured and reproducible manner.

Overview

The primary contribution of the paper is the development of a cloud-based pipeline that allows the extraction of comprehensive citation datasets from Wikipedia dumps. Initially, the authors extracted 29.3 million citations from English Wikipedia in May 2020 as a one-off project. Building upon this, they designed a reproducible pipeline demonstrated by extracting 40.6 million citations in February 2023 and 44.7 million citations in February 2024. The pipeline supports 15 European languages, translating and mapping citation templates from different language versions into a uniform English-based structured format.

Citation Extraction and Processing

The pipeline incorporates three main steps:

Citation template harmonization
Classification
Citation identifier lookup

It retrieves and harmonizes citation templates in XML Wikipedia dumps, converting them into a uniform key-value dataset amenable to further processing. The authors extend an existing MediaWiki parser to support additional frequently used citation templates, enhancing the ability to map various citation structures into a common schema.

Classification and Lookup Procedures

The classification process categorizes citations into "book," "journal," "news," and "other" categories using available identifiers such as DOI, PMC, PMID, and ISBN. URLs from a pre-compiled list of reputable news media domains are also used to identify news citations. Notably, the paper emphasizes the conservative nature of their deterministic classification to avoid biases inherent in training set labeling methods seen in previous studies.

The lookup process augments citations lacking identifiers by querying external citation databases like Crossref and Google Books. A crucial insight from the paper is that a significant proportion (38%) of potentially scientific citations can be augmented with identifiers through this process, highlighting gaps in current citation database coverage.

Results and Implications

The paper provides a detailed evaluation of the pipeline's performance over different language dumps:

For English Wikipedia, the pipeline identified a 10% increase in the overall number of citations from 2023 to 2024.
The "sci score," indicating the fraction of journal citations, was calculated at 5%, aligning with previous studies.
A broader classification, including books, raised the "sci score" to 12.34%.

The multilingual analysis reveals varying citation practices across languages, with a comprehensive table detailing percentage breakdowns of citation types and reliable sources. The results demonstrate the pipeline’s adaptability to different language structures, facilitating cross-linguistic research on Wikipedia.

Theoretical and Practical Implications

Practically, the open-source nature of the pipeline makes it accessible to researchers and citation consolidation bodies, enabling continuous updates and studies on Wikipedia citations. Theoretically, the dataset can facilitate analyses of citation practices, knowledge dissemination, and the integration of Wikipedia within the academic ecosystem. The manual mapping of language-specific citation templates to a common structure fosters comparative studies on citation practices across cultures and languages.

Future Directions

Future developments could focus on expanding the number of languages supported by the pipeline and further refining the classification algorithms to include additional identifiers. Collaborative integration with other open science platforms could enhance the utility of Wikipedia as a reliable and verifiable source. Additionally, the paper's insights on citation database coverage could stimulate efforts to improve the comprehensiveness and accessibility of citation indices.

Conclusion

This work presents a significant advancement in the integration of Wikipedia within the open science framework by providing a reproducible, scalable tool for citation extraction and analysis. It bridges the gap between Wikipedia and academic research, enhancing verifiability and fostering a more interconnected knowledge ecosystem. The dataset and codebase offer extensive opportunities for future research and collaboration.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (2)

Collections

Tweets

HackerNews

Wikipedia Citations: Reproducible Citation Extraction via Multilingual Wikipedia (2 points, 1 comment)

Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia

Summary

Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia

Overview

Citation Extraction and Processing

Classification and Lookup Procedures

Results and Implications

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

Tweets

HackerNews