Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

News Deja Vu: Connecting Past and Present with Semantic Search (2406.15593v2)

Published 21 Jun 2024 in cs.CL, econ.GN, and q-fin.EC

Abstract: Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts. For example, hundreds of millions of page scans from historical newspapers have been noisily transcribed. Traditional sparse methods for searching for relevant material in these vast corpora, e.g., with keywords, can be brittle given complex vocabularies and OCR noise. This study introduces News Deja Vu, a novel semantic search tool that leverages transformer LLMs and a bi-encoder approach to identify historical news articles that are most similar to modern news queries. News Deja Vu first recognizes and masks entities, in order to focus on broader parallels rather than the specific named entities being discussed. Then, a contrastively trained, lightweight bi-encoder retrieves historical articles that are most similar semantically to a modern query, illustrating how phenomena that might seem unique to the present have varied historical precedents. Aimed at social scientists, the user-friendly News Deja Vu package is designed to be accessible for those who lack extensive familiarity with deep learning. It works with large text datasets, and we show how it can be deployed to a massive scale corpus of historical, open-source news articles. While human expertise remains important for drawing deeper insights, News Deja Vu provides a powerful tool for exploring parallels in how people have perceived past and present.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Brevin Franklin (1 paper)
  2. Emily Silcock (7 papers)
  3. Abhishek Arora (12 papers)
  4. Tom Bryan (4 papers)
  5. Melissa Dell (17 papers)
Citations (1)

Summary

Connecting Historical and Modern Contexts with Semantic Search: An Overview of News D

The paper "News D: Connecting Past and Present with Semantic Search" by Brevin Franklin, et al., introduces a novel approach to semantic search, specifically designed to aid social scientists in drawing parallels between historical and contemporary events. At its core, News D leverages transformer-based LLMs and a bi-encoder methodology to identify historical news articles that match modern news queries semantically.

Key Methodological Contributions

The cornerstone of News D is its semantic search mechanism, which takes inspiration from modern LLM-based text retrieval techniques. The process can be summarized in several key steps:

  1. Named Entity Recognition (NER) and Masking: The system first detects and masks named entities within both the query and target articles. This step ensures that the retrieval focuses on the broader narrative rather than specific names, places, or organizations, thereby highlighting conceptual rather than superficial similarities.
  2. Contrastive Training of a Bi-Encoder: Using contrastive learning, News D's bi-encoder is trained to map semantically similar articles to proximate points in the embedding space. This involves using a combination of historical data and modern news pairs to fine-tune the model for robustness to OCR noise and language variations intrinsic to historical texts.

Numerical Performance and Evaluation

The robust performance of News D's components is evident in its numerical results:

  • The custom NER model achieved an F1 score of 90.4, outperforming other models such as Roberta-Large fine-tuned on CoNLL03, which garnered an F1 score of 77.8.
  • The bi-encoder's performance in pairwise classification tasks demonstrated an F1 score of 92.4, superior to existing models such as SBERT MPNET and previous noise-detection models.

These evaluations underscore News D's effectiveness in retrieving semantically relevant historical texts, despite the challenges posed by varied language use and significant OCR noise in historical documents.

Practical and Theoretical Implications

Practical Implications:

  • User Accessibility: Designed as an accessible tool for social scientists without deep expertise in machine learning, News D is available as a user-friendly package on PyPI. It can be seamlessly integrated with large-scale text datasets, such as the American Stories dataset on Hugging Face, containing over 430 million historical newspaper articles.
  • Potential Use Cases: By offering contextualized datasets wherein modern events are matched with their historical precedents, News D could assist researchers in identifying long-term trends, understanding societal reactions over time, and drawing insights for contemporary policy discussions.

Theoretical Implications:

  • Expanding the Role of NER: The innovative use of a fine-tuned NER model tailored for historical texts opens up new avenues for entity recognition under conditions of noise and varied orthographies, a scenario not extensively explored in existing literature.
  • Advancements in Contrastive Training: Demonstrating the efficacy of contrastive learning in aligning semantically similar texts across disparate time periods, this approach enhances our understanding of embedding space anisotropy and its mitigation through targeted training strategies.

Future Developments

Looking forward, the authors suggest the potential of extending News D's capabilities to support multiple languages by utilizing pre-trained multilingual embeddings such as those from Sentence BERT. Additionally, fine-tuning on machine-translated datasets could facilitate cross-linguistic searches, further broadening the tool's applicability in global historical research.

Conclusion

News D represents a significant methodological advancement in the domain of semantic search tailored to the needs of social scientists. By facilitating the retrieval of semantically similar historical articles, it enables a richer understanding of contemporary issues through the lens of historical context. The robust performance metrics and the practical usability of the News D package underscore its potential as a valuable tool in historical research and analysis. Future developments in multilingual support and enhanced training methods promise to expand its utility even further.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets