WikiHist.html: English Wikipedia's Full Revision History in HTML Format (2001.10256v3)

Published 28 Jan 2020 in cs.CY

Abstract: Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and mod-ules). Hence, researchers who intend to analyze Wikipediaas seen by its readers should work with HTML, rather than wikitext. Since Wikipedia's revision history is publicly available exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia's REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts ofdata and (2) does not correctly expand macros in historical article revisions. We solve these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and release WikiHist.html, English Wikipedia's full revision history in HTML format. We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext and that the missing links are important for user navigation.

Authors (3)

Blagoj Mitrevski (4 papers)
Tiziano Piccardi (22 papers)
Robert West (154 papers)

Citations (12)

View on Semantic Scholar

Summary

The paper presents WikiHist.html, the first comprehensive dataset providing the full HTML revision history of English Wikipedia, addressing scalability and historical accuracy issues with traditional wikitext parsing methods.
The dataset was generated using a novel parallelized architecture employing local MediaWiki instances capable of selecting historical template and module versions for accurate HTML representation.
Empirical analysis shows that HTML data is crucial for accurate analysis, revealing over 50% of visible hyperlinks are only present in HTML and are significant for user navigation.

The paper "WikiHist.html: English Wikipedia's Full Revision History in HTML Format" presents an innovative dataset that provides the complete revision history of the English Wikipedia in HTML format. This work addresses two significant challenges faced by researchers analyzing Wikipedia: scalability and historical accuracy in the translation of wikitext to HTML.

Background

Wikipedia, a vast and collaborative encyclopedia project, is of considerable importance in computational and social sciences research. The content within Wikipedia is stored using a markup language called wikitext, which is traditionally converted to HTML by the MediaWiki software when content is served to readers. This HTML includes enhanced information due to the expansion of templates and modules absent in raw wikitext data. Despite this, historical revisions of Wikipedia have been available only in their original wikitext format, requiring researchers to perform ad hoc and inevitably limited wikitext-to-HTML parsing. This methodology has fundamental drawbacks concerning scalability and accuracy:

Scalability: Converting the entire English Wikipedia using the Wikipedia REST API is highly inefficient, both in terms of time and resource usage.
Historical Accuracy: The MediaWiki API does not account for the exact historical versions of templates and modules, resulting in HTML that may differ from what users originally saw.

Contribution

The authors have developed a parallelized architecture that utilizes local instances of MediaWiki, allowing for scalable and historically accurate parsing of Wikipedia data. This approach accommodates the entire revision history, addressing both scalability and the precise expansion of templates and modules from historical article revisions, thereby overcoming the limitations posed by the API-based method.

Dataset: WikiHist.html

The WikiHist.html dataset is comprehensive, including:

HTML Revision History: Encompasses 580 million revisions across 5.8 million articles, approximating 7 TB of data in its compressed form.
Page Creation Dates: Offers tables detailing the creation dates of articles to assess whether links were active at particular historical times.
Redirect History: Provides data necessary to resolve redirects accurately at any point in time.

Advantages of HTML over Wikitext

The empirical analysis in the paper underscores the superior utility of HTML over wikitext, particularly regarding hyperlink analysis:

Hyperlink Prevalence: Over 50% of hyperlinks visible in HTML are not present in raw wikitext, highlighting the necessity for HTML-formatted data to examine articles as users do.
Navigational Importance: HTML-only links are shown to be equally pivotal for navigation within the platform as their wikitext counterparts, often influencing user navigation significantly.

System Architecture

The authors outline an elegant and efficient system architecture designed for large-scale parsing tasks:

Parallelization: Utilizes a multi-process approach where each parent process, assigned a CPU core, reads from disk and initiates child processes that handle parsing using individualized MediaWiki instances.
Template and Module Versioning: Historical accuracy is maintained by augmenting MediaWiki with code that selects the appropriate historical version of each template and module, ensuring the HTML output stays true to the original user view.

Conclusion

WikiHist.html provides a crucial resource for researchers seeking an accurate, complete depiction of Wikipedia as consumed by its users. The release of both the dataset and the parsing architecture code promises to significantly aid research relying on understanding Wikipedia's evolving content and its consumption patterns, potentially extending to multilingual versions of Wikipedia.

This detailed HTML dataset not only grants access to complete content as visited by readers historically but also facilitates in-depth analyses across various domains, encompassing hyperlink network analysis and more. As such, it represents a vital contribution to the research landscape surrounding Wikipedia.

PDF Markdown