Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Pushshift Telegram Dataset (2001.08438v1)

Published 23 Jan 2020 in cs.SI and cs.CY

Abstract: Messaging platforms, especially those with a mobile focus, have become increasingly ubiquitous in society. These mobile messaging platforms can have deceivingly large user bases, and in addition to being a way for people to stay in touch, are often used to organize social movements, as well as a place for extremists and other ne'er-do-well to congregate. In this paper, we present a dataset from one such mobile messaging platform: Telegram. Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users. To the best of our knowledge, our dataset comprises the largest and most complete of its kind. In addition to the raw data, we also provide the source code used to collect it, allowing researchers to run their own data collection instance. We believe the Pushshift Telegram dataset can help researchers from a variety of disciplines interested in studying online social movements, protests, political extremism, and disinformation.

Citations (50)

Summary

  • The paper presents the extensive Pushshift Telegram Dataset, comprising 317 million messages from 2.2 million users across 27,801 channels, collected via API using snowball sampling.
  • Analysis reveals significant insights into Telegram usage, including steady channel growth, high message activity (especially during sociopolitical events), and prevalent use of media, mentions, and hashtags.
  • The dataset serves as a crucial resource for studying communication patterns, political mobilization, and the spread of extremist ideologies and disinformation, enabling future research into digital communication structures.

Overview of The Pushshift Telegram Dataset

The paper presents an extensive dataset from the Telegram messaging platform, an app increasingly recognized for its dual role as a communication tool for social movements and a congregation point for extremist groups. Authored by Jason Baumgartner, Savvas Zannettou, Megan Squire, and Jeremy Blackburn, the dataset is notable for its size and breadth, encompassing 317 million messages from 2.2 million unique users across 27,801 channels. Such comprehensiveness positions it as a leading resource for researchers in disciplines focusing on social movements, information dissemination, political extremism, and disinformation.

Dataset Description and Structure

Telegram facilitates both private conversations and public interactions through its channel feature, which supports one-to-many communication, enabling information dissemination similar to traditional broadcasting. The dataset is structured around these channels and captures not only message content but also extensive metadata related to users and channels. The data collection mechanism works through Telegram’s API via Telethon, a Python library. Seeded initially with English-language channels associated with cryptocurrency and right-wing extremism, it employs a snowball sampling technique based on the platform's forwarding feature, allowing it to expand to nearly 28,000 channels.

Individual messages, channel characteristics, and metadata about users form the main components of the dataset. Despite not collecting multimedia attachments initially, the paper outlines plans to include these in future datasets by securing adequate storage solutions. For improved interoperability and accessibility, the dataset adheres to the FAIR principles, facilitating research through easy access, standard formatting, and comprehensive documentation.

Key Findings from the Dataset

The analysis reveals significant insights into Telegram's user dynamics and functionalities:

  • Channel Growth: Telegram channels saw substantial initial creation following their introduction in 2015, with further growth maintaining a steady trajectory. Channels display varying levels of activity and user participation.
  • Message Characteristics: Analysis of more than 317 million messages indicates prevalent activity, especially amid key sociopolitical events like the Hong Kong protests. The dataset captures insights on message length, forwarding patterns, and media usage within messages.
  • Media, Mentions, and Hashtags: The dataset discloses that over half of the messages contained media, predominantly photos and documents. Mentions and hashtags form another key area of focus, illustrating user engagement methods, though mentions are significantly more common than hashtags.

Implications and Future Work

The Pushshift Telegram Dataset is poised to substantially enhance the paper of communication patterns within digital ecosystems, especially concerning political mobilization and the spread of extremist ideologies. The dataset’s richness provides opportunities to analyze how information propagates and is utilized in varied sociopolitical contexts, including during periods of civil unrest. Researchers could investigate these dynamics to develop interventions for monitoring and managing misinformation or to understand digital mobilization's impact.

Moreover, the comprehensive dataset invites further exploration into Telegram’s unique gateway role between direct messaging and open social media interactions. Future research could also leverage the publicly available data collection source code to generate updated datasets continuously or focus on emerging temporal patterns in response to global events.

Through its careful construction and adherence to principles allowing widespread use, the Pushshift Telegram Dataset stands as an essential resource for academic inquiry into modern digital communication structures, significantly contributed by the authors. The findings anticipate a trajectory of growing analytical work aimed at understanding and possibly regulating virtual social spaces integral to contemporary digital communication and social movement strategy.

Youtube Logo Streamline Icon: https://streamlinehq.com