Papers
Topics
Authors
Recent
2000 character limit reached

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

Published 31 Mar 2020 in cs.DL and cs.DB | (2003.14046v1)

Abstract: The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.

Citations (9)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.