SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks (2405.10700v1)

Published 17 May 2024 in cs.IR, cs.AI, cs.CL, and cs.CY

Abstract: Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier LLMs to train local, specialized LLMs. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up human-led fact-checking at a fraction of the cost of human-annotated data. Training on SynDy's generated labels shows improvement over a standard baseline and is not significantly worse compared to training on human labels (which may be infeasible to acquire). SynDy is being integrated into Meedan's chatbot tiplines that are used by over 50 organizations, serve over 230K users annually, and automatically distribute human-written fact-checks via messaging apps such as WhatsApp. SynDy will also be integrated into our deployed Co-Insights toolkit, enabling low-resource organizations to launch tiplines for their communities. Finally, we envision SynDy enabling additional fact-checking tools such as matching new misinformation claims to high-quality explainers on common misinformation topics.

References (32)

Summary

The paper introduces SynDy, a framework that uses large language models to generate synthetic data, reducing reliance on manual annotation.
The framework achieves a 12.5% improvement in claim matching and a robust Macro-F1 score of 0.863 in claim classification.
SynDy scales misinformation response on underrepresented topics, offering comparable results to human-annotated datasets.

SynDy: A Framework for Synthetic Dataset Generation in Misinformation Tasks

Introduction

Misinformation continues to be a relentless challenge in the digital age, with ramifications extending from political influence to public health crises. Traditional fact-checking methods often struggle to keep up, particularly for emerging topics or niche communities. This is where SynDy, short for Synthetic Dynamic Dataset Generation framework, comes into play.

SynDy leverages the capabilities of LLMs to create synthetic datasets for training localized and specialized models. This helps in scaling up misinformation-mitigation tasks like Claim Matching, Topical Clustering, and Claim Relationship Classification. Let's dive into what SynDy is, how it works, and why it’s a credible tool for fighting misinformation.

The SynDy Framework

SynDy aims to address a critical gap: the lack of annotated data for emerging and niche topics. Here's a breakdown of how it tackles this:

Dataset Selection

Keyword Generation: For any given topic, SynDy uses keyword generation techniques to create queries.
Social Media Queries: These generated queries are used to scrape social media data (like Reddit or Twitter posts) relevant to the topic.

Synthetic Dataset Annotation

Claim Extraction: Using LLMs, SynDy extracts claims from the gathered social media posts.
Topic Clustering: Posts are clustered based on their topics, helping identify narrative themes.
Claim Relationship Classification: SynDy identifies relationships between claims, classifying them as either supporting or undermining each other.

These tasks provide a labeled dataset that can be used to train models specifically for misinformation-mitigation tasks.

Real-World Challenges Addressed

Neglected Topics in Mainstream Misinformation

Many diaspora communities encounter misinformation that is often ignored by mainstream fact-checking. SynDy aims to bridge this gap by creating datasets for these underrepresented topics, allowing smaller organizations to leverage LLMs without the need for extensive, human-annotated data.

Scaling Misinformation Response Efforts

Traditional fact-checking is usually reactive—addressing false claims as they appear. SynDy helps shift this paradigm by clustering claims into larger narrative themes. This enables a proactive approach, where high-quality explainer content and pre-bunking materials can be prepared in advance.

Experimental Results

The framework was tested on real-world datasets to evaluate its efficacy. The results were promising:

Claim Matching: SynDy-trained models showed a 12.5% improvement over a zero-shot baseline in MAP@20.
Topical Clustering: Models trained on SynDy-generated labels performed similarly to those trained on human-annotated data, with the Perspectrum dataset seeing even better performance.
Claim Relationship Classification: While the models trained on synthetic data lagged slightly behind those trained on human data, they still performed robustly with a Macro-F1 score of 0.863 compared to 0.907.

Overall, these results indicate that SynDy can effectively substitute for human-annotated data in many cases, providing a feasible and scalable solution for misinformation mitigation.

Implications and Future Directions

The practical implications of SynDy can be summarized as follows:

Cost Efficiency: Generating synthetic data is far less expensive and faster than manual annotation, making large-scale fact-checking efforts more feasible.
Broader Coverage: By focusing on neglected topics, SynDy helps bring attention to misinformation affecting smaller, often overlooked communities.
Scalable Solutions: Integrating SynDy into platforms like Meedan’s tiplines can significantly expand the capabilities of misinformation-response initiatives.

In the future, SynDy could evolve to include more sophisticated tasks and improve its relationship classification algorithms. More realistic evaluation scenarios and real-time application in misinformation detection pipelines could further validate and refine the framework.

Conclusion

SynDy represents a practical and effective framework for generating synthetic datasets aimed at misinformation mitigation. By leveraging LLMs, the framework can tackle tasks like Claim Matching, Topical Clustering, and Claim Relationship Classification, providing essential tools to scale up human-led fact-checking efforts. The promising experimental results indicate that models trained on SynDy-generated labels can perform comparably to those trained on human-annotated data. This makes SynDy a valuable asset for researchers, journalists, and fact-checkers in the ongoing battle against misinformation.