- The paper presents the extensive Pushshift Telegram Dataset, comprising 317 million messages from 2.2 million users across 27,801 channels, collected via API using snowball sampling.
- Analysis reveals significant insights into Telegram usage, including steady channel growth, high message activity (especially during sociopolitical events), and prevalent use of media, mentions, and hashtags.
- The dataset serves as a crucial resource for studying communication patterns, political mobilization, and the spread of extremist ideologies and disinformation, enabling future research into digital communication structures.
Overview of The Pushshift Telegram Dataset
The paper presents an extensive dataset from the Telegram messaging platform, an app increasingly recognized for its dual role as a communication tool for social movements and a congregation point for extremist groups. Authored by Jason Baumgartner, Savvas Zannettou, Megan Squire, and Jeremy Blackburn, the dataset is notable for its size and breadth, encompassing 317 million messages from 2.2 million unique users across 27,801 channels. Such comprehensiveness positions it as a leading resource for researchers in disciplines focusing on social movements, information dissemination, political extremism, and disinformation.
Dataset Description and Structure
Telegram facilitates both private conversations and public interactions through its channel feature, which supports one-to-many communication, enabling information dissemination similar to traditional broadcasting. The dataset is structured around these channels and captures not only message content but also extensive metadata related to users and channels. The data collection mechanism works through Telegram’s API via Telethon, a Python library. Seeded initially with English-language channels associated with cryptocurrency and right-wing extremism, it employs a snowball sampling technique based on the platform's forwarding feature, allowing it to expand to nearly 28,000 channels.
Individual messages, channel characteristics, and metadata about users form the main components of the dataset. Despite not collecting multimedia attachments initially, the paper outlines plans to include these in future datasets by securing adequate storage solutions. For improved interoperability and accessibility, the dataset adheres to the FAIR principles, facilitating research through easy access, standard formatting, and comprehensive documentation.
Key Findings from the Dataset
The analysis reveals significant insights into Telegram's user dynamics and functionalities:
- Channel Growth: Telegram channels saw substantial initial creation following their introduction in 2015, with further growth maintaining a steady trajectory. Channels display varying levels of activity and user participation.
- Message Characteristics: Analysis of more than 317 million messages indicates prevalent activity, especially amid key sociopolitical events like the Hong Kong protests. The dataset captures insights on message length, forwarding patterns, and media usage within messages.
- Media, Mentions, and Hashtags: The dataset discloses that over half of the messages contained media, predominantly photos and documents. Mentions and hashtags form another key area of focus, illustrating user engagement methods, though mentions are significantly more common than hashtags.
Implications and Future Work
The Pushshift Telegram Dataset is poised to substantially enhance the paper of communication patterns within digital ecosystems, especially concerning political mobilization and the spread of extremist ideologies. The dataset’s richness provides opportunities to analyze how information propagates and is utilized in varied sociopolitical contexts, including during periods of civil unrest. Researchers could investigate these dynamics to develop interventions for monitoring and managing misinformation or to understand digital mobilization's impact.
Moreover, the comprehensive dataset invites further exploration into Telegram’s unique gateway role between direct messaging and open social media interactions. Future research could also leverage the publicly available data collection source code to generate updated datasets continuously or focus on emerging temporal patterns in response to global events.
Through its careful construction and adherence to principles allowing widespread use, the Pushshift Telegram Dataset stands as an essential resource for academic inquiry into modern digital communication structures, significantly contributed by the authors. The findings anticipate a trajectory of growing analytical work aimed at understanding and possibly regulating virtual social spaces integral to contemporary digital communication and social movement strategy.