- The paper presents WikiHist.html, the first comprehensive dataset providing the full HTML revision history of English Wikipedia, addressing scalability and historical accuracy issues with traditional wikitext parsing methods.
- The dataset was generated using a novel parallelized architecture employing local MediaWiki instances capable of selecting historical template and module versions for accurate HTML representation.
- Empirical analysis shows that HTML data is crucial for accurate analysis, revealing over 50% of visible hyperlinks are only present in HTML and are significant for user navigation.
The paper "WikiHist.html: English Wikipedia's Full Revision History in HTML Format" presents an innovative dataset that provides the complete revision history of the English Wikipedia in HTML format. This work addresses two significant challenges faced by researchers analyzing Wikipedia: scalability and historical accuracy in the translation of wikitext to HTML.
Background
Wikipedia, a vast and collaborative encyclopedia project, is of considerable importance in computational and social sciences research. The content within Wikipedia is stored using a markup language called wikitext, which is traditionally converted to HTML by the MediaWiki software when content is served to readers. This HTML includes enhanced information due to the expansion of templates and modules absent in raw wikitext data. Despite this, historical revisions of Wikipedia have been available only in their original wikitext format, requiring researchers to perform ad hoc and inevitably limited wikitext-to-HTML parsing. This methodology has fundamental drawbacks concerning scalability and accuracy:
- Scalability: Converting the entire English Wikipedia using the Wikipedia REST API is highly inefficient, both in terms of time and resource usage.
- Historical Accuracy: The MediaWiki API does not account for the exact historical versions of templates and modules, resulting in HTML that may differ from what users originally saw.
Contribution
The authors have developed a parallelized architecture that utilizes local instances of MediaWiki, allowing for scalable and historically accurate parsing of Wikipedia data. This approach accommodates the entire revision history, addressing both scalability and the precise expansion of templates and modules from historical article revisions, thereby overcoming the limitations posed by the API-based method.
Dataset: WikiHist.html
The WikiHist.html dataset is comprehensive, including:
- HTML Revision History: Encompasses 580 million revisions across 5.8 million articles, approximating 7 TB of data in its compressed form.
- Page Creation Dates: Offers tables detailing the creation dates of articles to assess whether links were active at particular historical times.
- Redirect History: Provides data necessary to resolve redirects accurately at any point in time.
Advantages of HTML over Wikitext
The empirical analysis in the paper underscores the superior utility of HTML over wikitext, particularly regarding hyperlink analysis:
- Hyperlink Prevalence: Over 50% of hyperlinks visible in HTML are not present in raw wikitext, highlighting the necessity for HTML-formatted data to examine articles as users do.
- Navigational Importance: HTML-only links are shown to be equally pivotal for navigation within the platform as their wikitext counterparts, often influencing user navigation significantly.
System Architecture
The authors outline an elegant and efficient system architecture designed for large-scale parsing tasks:
- Parallelization: Utilizes a multi-process approach where each parent process, assigned a CPU core, reads from disk and initiates child processes that handle parsing using individualized MediaWiki instances.
- Template and Module Versioning: Historical accuracy is maintained by augmenting MediaWiki with code that selects the appropriate historical version of each template and module, ensuring the HTML output stays true to the original user view.
Conclusion
WikiHist.html provides a crucial resource for researchers seeking an accurate, complete depiction of Wikipedia as consumed by its users. The release of both the dataset and the parsing architecture code promises to significantly aid research relying on understanding Wikipedia's evolving content and its consumption patterns, potentially extending to multilingual versions of Wikipedia.
This detailed HTML dataset not only grants access to complete content as visited by readers historically but also facilitates in-depth analyses across various domains, encompassing hyperlink network analysis and more. As such, it represents a vital contribution to the research landscape surrounding Wikipedia.