Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research (2407.08323v1)

Published 11 Jul 2024 in cs.CY

Abstract: Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of LLMs to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using LLMs to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.

PDF HTML Abstract

Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

The paper conducted by Henry Tari et al. investigates the use of LLMs, specifically ChatGPT, for generating synthetic social media datasets that span multiple platforms. These synthetic datasets are intended to mirror real data closely, facilitating research in areas like disinformation, influence operations, social sensing, hate speech detection, and cyberbullying. Given the increasing restrictions on data access due to privacy, legal, and policy concerns, especially affected by changes in platforms like Twitter (now known as X), this paper is highly relevant.

Methodology

The authors employed ChatGPT to generate synthetic datasets from two original datasets. The first dataset, discussed in the US 2022 midterm elections, comprises posts from Twitter, Facebook, and Reddit. The second dataset includes social media influencers' posts from TikTok, Instagram, and YouTube. A total of 1000 posts from each platform were randomly selected for the paper.

Two distinct prompting strategies were employed for data generation:

Platform Aware: ChatGPT was explicitly instructed to generate content specific to a given platform.
Platform Agnostic: ChatGPT was not informed about the specific platform for which the content was being generated, but the prompts were grounded in context-specific examples.

Evaluation Metrics

The evaluation of the generated datasets was based on several dimensions:

Lexical Features: The paper assessed the frequency and variety of hashtags, user tags, URLs, and emojis in synthetic datasets compared to real datasets.
Sentiment Analysis: Sentiment expressed in synthetic posts was compared to those in real posts using a pre-trained RoBERTa-based model for Twitter sentiment analysis.
Topic Overlap: Topic extraction using BERTopic was employed to identify common and unique topics between real and synthetic datasets across different platforms.
Embedding Similarity: Embedding vectors for posts were generated and compared using cosine similarity to measure the semantic closeness between real and synthetic datasets.

Key Findings

Lexical Features: The paper found that synthetic datasets were generally accurate in replicating lexical structures typical of real social media posts, such as emojis and hashtags. However, synthetic datasets often showed less reuse of hashtags and tags, thus reflecting less coherence among posts on similar topics.

Sentiment Analysis: The sentiment analysis revealed that synthetic posts were generally more positive and less negative compared to real posts. This aligns with existing literature that suggests OpenAI's models tend to produce less toxic content.

Topic Overlap: Topics extracted using BERTopic showed that the synthetic and real datasets largely covered similar general subjects. However, the relative prominence of subtopics varied, with synthetic datasets sometimes introducing new yet contextually relevant topics.

Embedding Similarity: The embedding similarity demonstrated that synthetic datasets were fairly close to real datasets, although some platforms like Reddit and TikTok showed lower similarity. This could be attributed to the length and complexity of posts on these platforms.

Implications and Future Directions

This paper demonstrates the potential of using LLMs like ChatGPT to create realistic synthetic social media datasets. Such datasets can enhance reproducibility in social media research, enabling studies that are otherwise hindered by data access restrictions. However, there are areas for improvement:

Hashtag and Tag Reuse: Future prompts could emphasize shared hashtags and users among examples to better simulate real data coherence.
Sentiment Balance: Techniques to generate more balanced synthetic sentiments could be beneficial, especially for areas where negative content is critical for analysis.
Legal and Privacy Concerns: While generating synthetic data bypasses privacy issues, further research is needed to ensure these data do not inadvertently reveal or misuse personal information.

Conclusion

The research by Tari et al. makes significant strides toward creating high-fidelity synthetic social media datasets using ChatGPT. While the paper identifies areas for enhancement, the initial results are promising and suggest that with refined techniques, synthetic data can serve as a practical alternative to real data for various research applications in social media.

Future work should explore the use of different LLMs, include diverse languages, and address privacy and legal concerns comprehensively to fully realize the potential of synthetic datasets in advancing social media research. The continuous evolution of LLMs offers exciting prospects for improving the fidelity and utility of synthetic social media data, thereby supporting more transparent and reproducible scientific inquiries.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Henry Tari (2 papers)
Danial Khan (4 papers)
Justus Rutten (1 paper)
Darian Othman (1 paper)
Rishabh Kaushal (7 papers)
Thales Bertaglia (12 papers)
Adriana Iamnitchi (34 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/CatalinaGoanta/status/1814480253055611011