The Pushshift Reddit Dataset (2001.08435v1)

Published 23 Jan 2020 in cs.SI and cs.CY

Abstract: Social media data has become crucial to the advancement of scientific understanding. However, even though it has become ubiquitous, just collecting large-scale social media data involves a high degree of engineering skill set and computational resources. In fact, research is often times gated by data engineering problems that must be overcome before analysis can proceed. This has resulted recognition of datasets as meaningful research contributions in and of themselves. Reddit, the so called "front page of the Internet," in particular has been the subject of numerous scientific studies. Although Reddit is relatively open to data acquisition compared to social media platforms like Facebook and Twitter, the technical barriers to acquisition still remain. Thus, Reddit's millions of subreddits, hundreds of millions of users, and hundreds of billions of comments are at the same time relatively accessible, but time consuming to collect and analyze systematically. In this paper, we present the Pushshift Reddit dataset. Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data and made it available to researchers. Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception. In addition to monthly dumps, Pushshift provides computational tools to aid in searching, aggregating, and performing exploratory analysis on the entirety of the dataset. The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.

Citations (813)

View on Semantic Scholar

Summary

The paper introduces a robust framework combining an ingest engine, PostgreSQL, and Elasticsearch for real-time social media data collection.
It details a comprehensive Reddit dataset with 651 million submissions and 5.6 billion comments, enabling granular analysis of online communities.
The paper highlights the dataset's impact on computational social science and discusses future directions in data ethics and scalable research tools.

The Pushshift Reddit Dataset

The paper "The Pushshift Reddit Dataset," authored by Jason Baumgartner et al., provides an in-depth exploration of the Pushshift platform's social media data collection and dissemination capabilities. This dataset, particularly focusing on Reddit data, is a valuable asset for computational social science research. The paper meticulously outlines the technical infrastructure behind Pushshift, the dataset's structure, and the various tools provided to facilitate research.

Introduction

The introduction situates the paper within the broader context of social media research, emphasizing the importance of large-scale, reliable data sets for understanding socio-technical phenomena. The authors note the challenges in collecting data from platforms like Facebook and Twitter due to increasingly restrictive privacy policies, which have led to the term "post-API age." This term reflects the difficulties researchers face in accessing necessary data for their studies, highlighting the significance of datasets like Pushshift's that remain accessible and comprehensive.

Pushshift Infrastructure

Pushshift's infrastructure is designed to facilitate real-time data collection, storage, and dissemination. The platform employs multiple backend components, including an ingest engine, PostgreSQL database, and Elasticsearch document store. These elements work together to ensure that data is collected efficiently and can be queried effectively by researchers.

Ingest Engine: Handles the collection of raw data from various social media sources, particularly Reddit.
PostgreSQL and Elasticsearch: Index and store data, providing robust querying and aggregation capabilities.
API: Enables researchers to access the data without downloading large datasets, thus lowering the technical barriers to entry.

The architecture ensures scalability and flexibility, making it a sustainable tool for long-term social media research.

Dataset Description

The Pushshift Reddit dataset includes 651 million submissions and 5.6 billion comments from 2.88 million subreddits, covering a period from 2005 to 2019. The dataset is divided into submissions and comments, each represented as newline-delimited JSON objects with detailed fields such as id, author, created_utc, subreddit, score, and more. This meticulous organization allows for granular analysis of user and community behavior on Reddit.

Community and Outreach

Pushshift maintains an active community of researchers and users through platforms like Reddit and Slack. This community-driven approach facilitates continuous feedback and improvement of the dataset and tools. The Slackbot and API provide real-time interaction with the dataset, enabling rapid visualization and analysis, which is essential for dynamic research environments.

Use Cases

The paper highlights several use cases where the Pushshift Reddit dataset has already contributed significantly:

Online Community Governance: Analyzing moderation strategies and their effects on user behavior.
Online Extremism: Understanding the spread of extremist ideologies and hate speech.
Online Disinformation: Studying the dissemination of fake news and propaganda.
Web Science: Investigating user engagement, social media dynamics, and technological adoption.
Health Informatics: Researching sensitive topics like mental health and substance abuse through anonymous online discussions.
Robust Intelligence: Enhancing natural language processing, recommendation systems, and intelligent agents using large-scale text data.

Implications and Future Directions

The Pushshift Reddit dataset offers vital contributions to various fields, providing a comprehensive resource for computational social science. However, it also raises important questions about data ethics, privacy, and the future of social media research. As platforms continue to restrict data access, datasets like Pushshift become even more critical.

Future developments could include expanding the dataset to include other social media platforms, improving real-time data collection capabilities, and enhancing tools for data analysis. The authors suggest that maintaining such datasets will require continuous collaboration between researchers, platform providers, and data engineers to navigate the complexities of data privacy and ethical considerations.

Conclusion

"The Pushshift Reddit Dataset" paper serves as a foundational reference for researchers seeking to leverage large-scale social media data. The detailed technical description, coupled with the demonstrated use cases, illustrates the dataset's potential to drive significant advances in understanding online behavior. The Pushshift platform, with its robust infrastructure and active community, stands out as a crucial tool in the computational social scientist's toolkit.

PDF Markdown

Related Papers

YouTube

Show All Videos