Newswire: A Large-Scale Structured Database of a Century of Historical News (2406.09490v1)

Published 13 Jun 2024 in cs.CL, econ.GN, and q-fin.EC

Abstract: In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers. The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. To construct the Newswire dataset, we first recognize newspaper layouts and transcribe around 138 millions structured article texts from raw image scans. We then use a customized neural bi-encoder model to de-duplicate reproduced articles, in the presence of considerable abridgement and noise, quantifying how widely each article was reproduced. A text classifier is used to ensure that we only include newswire articles, which historically are in the public domain. The structured data that accompany the texts provide rich information about the who (disambiguated individuals), what (topics), and where (georeferencing) of the news that millions of Americans read over the course of a century. We also include Library of Congress metadata information about the newspapers that ran the articles on their front pages. The Newswire dataset is useful both for LLMing - expanding training data beyond what is available from modern web texts - and for studying a diversity of questions in computational linguistics, social science, and the digital humanities.

Authors (4)

Emily Silcock (7 papers)
Abhishek Arora (12 papers)
Luca D'Amico-Wong (7 papers)
Melissa Dell (17 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper presents a comprehensive large-scale database of 2.7 million U.S. newswire articles spanning 1878 to 1977.
It utilizes advanced deep learning pipelines—OCR, layout recognition, entity disambiguation, and georeferencing—to process and validate vast historical data.
The study achieves a 91.5% adjusted Rand index with a neural bi-encoder model, significantly enhancing methods for data deduplication and topic classification.

A Comprehensive Overview of "Newswire: A Large-Scale Structured Database of a Century of Historical News"

The paper "Newswire: A Large-Scale Structured Database of a Century of Historical News" by Melissa Dell and colleagues presents an exhaustive account of the creation of a robust and structured dataset comprising U.S. newswire articles published between 1878 and 1977. This paper effectively addresses the broad absence of comprehensive archival data about historical newswire content and provides an intricate explanation of the methodologies employed to obtain, process, and analyze this extensive dataset.

The authors have undertaken the task of constructing a database containing 2.7 million unique articles drawn from hundreds of terabytes of raw image scans from local newspapers across the United States, addressing the need to expand training data for historical insights beyond the reach of current web-based resources. Through a series of sophisticated deep learning pipelines, including layout recognition, OCR, entity recognition, georeferencing, and entity disambiguation, the team has managed to transcribe and validate millions of articles, identifying duplicated content and ensuring fidelity to the original source materials.

The methodological rigour highlighted in the paper, particularly the use of a customized neural bi-encoder model, is critical for researchers seeking to explore duplicated content across vast textual datasets. The mention of achieving an adjusted Rand index of 91.5% in detecting reproduced content speaks to the efficacy of implementing contrastively trained syntactic similarity models against this complex historical data. This mechanism becomes particularly relevant for those investigating areas within computational linguistics and data deduplication.

Another technical achievement detailed in the paper is the employment of a neural topic classification model, boosting capabilities to assess the topics central to historical narratives. The classification spans topics such as politics, crime, and labor movements, offering insight into the social fabric of the times. Meanwhile, new entity disambiguation processes enrich this dataset by associating the vast body of news articles with corresponding Wikipedia entries. Much focus is directed here towards improving disambiguation across historical contexts, with perceptions of the importance of precise entity recognition and association for advanced natural language processing tasks driven home throughout.

The dataset's compound metadata—from georeferencing to tagged topics and named entities—opens significant avenues for further research in myriad domains. Researchers interested in studying how news dissemination influenced political or cultural climates could draw upon this dataset to formulate metrics for replication, influence, or bias assessment over time. Also, the structured datasets aid in studying the reproduction of historical narratives and thus examining the possibilities of biases inherent in the spread of information through newswires across geographical regions.

The constraints enforced by historical copyright laws limit the dataset to pre-1978 content, shedding light on critical challenges for researchers in balancing comprehensive historical data collection with legal considerations. Despite this limitation, the Newswire dataset poses a valuable resource for architects of future LLMs specifically intent on embracing historical documentation as a reservoir of learning material.

In conclusion, this paper stands as a comprehensive narrative on a substantial project that bridges historical data collection, advanced deep learning technologies, and multidisciplinary applications. The dataset not only enriches LLMing with historical textures but also pioneers methodological advancements which set precedents for developing other structured historical corpuses. Future research may contemplate further enhancing OCR quality or decoding untranslated archival segments, potentially using emergent LLMs as tools in historical breakthroughs or new domain-specific applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MelissaLDell/status/1806760404389179483

https://twitter.com/fdaudens/status/1810337445319200878

https://twitter.com/saeedamenfx/status/1803034934149329287

https://twitter.com/CapybaraPapers/status/1802772005533413438

https://twitter.com/betterhn20/status/1807586166969123183

https://twitter.com/warriors_mom/status/1807887716375310660

HackerNews

Newswire: A large-scale structured database of a century of historical news (165 points, 39 comments)
Newswire: A Large-Scale Structured Database of a Century of Historical News (2 points, 0 comments)