Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks (2405.10700v1)

Published 17 May 2024 in cs.IR, cs.AI, cs.CL, and cs.CY

Abstract: Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier LLMs to train local, specialized LLMs. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up human-led fact-checking at a fraction of the cost of human-annotated data. Training on SynDy's generated labels shows improvement over a standard baseline and is not significantly worse compared to training on human labels (which may be infeasible to acquire). SynDy is being integrated into Meedan's chatbot tiplines that are used by over 50 organizations, serve over 230K users annually, and automatically distribute human-written fact-checks via messaging apps such as WhatsApp. SynDy will also be integrated into our deployed Co-Insights toolkit, enabling low-resource organizations to launch tiplines for their communities. Finally, we envision SynDy enabling additional fact-checking tools such as matching new misinformation claims to high-quality explainers on common misinformation topics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Misinformation on Misinformation: Conceptual and Methodological Challenges. Social Media + Society 9, 1 (2023). https://doi.org/10.1177/20563051221150412
  2. Twitter Topic Classification. ArXiv abs/2209.09824 (2022).
  3. Phoebe Arnold. 2020. The challenges of online fact checking. Full fact 17 (2020).
  4. Salil D. Benegal and Lyle A. Scruggs. 2018. Correcting misinformation about climate change: the impact of partisanship in an experimental setting. Climatic Change 148, 1 (2018), 61–80. https://doi.org/10.1007/s10584-018-2192-4
  5. Steve Blank and Bob Dorf. 2020. The startup owner’s manual: The step-by-step guide for building a great company. John Wiley & Sons.
  6. Seeing Things from a Different Angle: Discovering Diverse Perspectives about Claims. CoRR abs/1906.03538 (2019). arXiv:1906.03538 http://arxiv.org/abs/1906.03538
  7. The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering. arXiv:2305.16519 [cs.CL]
  8. Owen Dyer. 2017. Measles outbreak in Somali American community follows anti-vaccine talks. BMJ 357 (2017). https://doi.org/10.1136/bmj.j2378 arXiv:https://www.bmj.com/content/357/bmj.j2378.full.pdf
  9. Robin Emsley. 2023. ChatGPT: these are not hallucinations–they’re fabrications and falsifications. Schizophrenia 9, 1 (2023), 52.
  10. The impact of misinformation on the COVID-19 pandemic. AIMS Public Health 9, 2 (2022). https://doi.org/10.3934%2Fpublichealth.2022018
  11. Full Fact. 2019. Report on the Facebook Third Party Fact Checking programme.
  12. D Graves. 2018. Understanding the promise and limits of automated fact-checking. Reuters Institute for the Study of Journalism (2018).
  13. Chatbot Confabulations Are Not Hallucinations. JAMA Internal Medicine 183, 10 (10 2023), 1177–1177. https://doi.org/10.1001/jamainternmed.2023.4231
  14. The role of non–COVID-specific and COVID-specific factors in predicting a shift in willingness to vaccinate: A panel study. Proceedings of the National Academy of Sciences 118, 52 (2021), e2112266118. https://doi.org/10.1073/pnas.2112266118 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2112266118
  15. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4504–4517. https://doi.org/10.18653/v1/2021.acl-long.347
  16. It could have been much worse: The Minnesota measles outbreak of 2017. Vaccine 36, 14 (2018), 1808–1810. https://doi.org/10.1016/j.vaccine.2018.02.086
  17. Beyond Misinformation: Understanding and Coping with the “Post-Truth” Era. Journal of Applied Research in Memory and Cognition 6, 4 (2017), 353–369. https://doi.org/10.1016/j.jarmac.2017.07.008
  18. Stephan Lewandowsky and Sander Van Der Linden. 2021. Countering misinformation and fake news through inoculation and prebunking. European Review of Social Psychology 32, 2 (2021), 348–384.
  19. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  20. Meedan, Inc. 2022. Annual Report 2022. https://meedan.com/post/annual-report-2022
  21. Overview of the CLEF-2022 CheckThat! Lab on Fighting the COVID-19 Infodemic and Fake News Detection. In Proceedings of the 13th International Conference of the CLEF Association: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization (CLEF ’2022). Bologna, Italy.
  22. Overview of the CLEF-2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. CoRR abs/2109.12987 (2021). arXiv:2109.12987 https://arxiv.org/abs/2109.12987
  23. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. CoRR abs/2108.08877 (2021). arXiv:2108.08877 https://arxiv.org/abs/2108.08877
  24. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  25. Yotam Ophir and Kathleen Hall Jamieson. 2021. The effects of media narratives about failures and discoveries in science on beliefs about and support for science. Public Understanding of Science 30, 8 (2021), 1008–1023. https://doi.org/10.1177/09636625211012630 PMID: 34000907.
  26. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
  27. Social and Cognitive Aspects of the Vulnerability to Political Misinformation. Political Psychology 42, S1 (Dec. 2021), 267–304. https://doi.org/10.1111/pops.12797
  28. Gordon Pennycook and David G. Rand. 2021. The Psychology of Fake News. Trends in Cognitive Sciences 25, 5 (May 2021), 388–402. https://doi.org/10.1016/j.tics.2021.02.007
  29. Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal. Scientific Reports 12, 1 (2022). https://doi.org/10.1038/s41598-022-10070-w
  30. The Covid-19 Infodemic — Applying the Epidemiologic Model to Counter Misinformation. New England Journal of Medicine (2021). https://doi.org/10.1056/NEJMp2103798
  31. That is a Known Lie: Detecting Previously Fact-Checked Claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 3607–3618. https://doi.org/10.18653/v1/2020.acl-main.332
  32. Cristian Vaccari and Marco Morini. 2014. The Power of Smears in Two American Presidential Campaigns. Journal of Political Marketing 13, 1-2 (2014), 19–45. https://doi.org/10.1080/15377857.2014.866021

Summary

  • The paper introduces SynDy, a framework that uses large language models to generate synthetic data, reducing reliance on manual annotation.
  • The framework achieves a 12.5% improvement in claim matching and a robust Macro-F1 score of 0.863 in claim classification.
  • SynDy scales misinformation response on underrepresented topics, offering comparable results to human-annotated datasets.

SynDy: A Framework for Synthetic Dataset Generation in Misinformation Tasks

Introduction

Misinformation continues to be a relentless challenge in the digital age, with ramifications extending from political influence to public health crises. Traditional fact-checking methods often struggle to keep up, particularly for emerging topics or niche communities. This is where SynDy, short for Synthetic Dynamic Dataset Generation framework, comes into play.

SynDy leverages the capabilities of LLMs to create synthetic datasets for training localized and specialized models. This helps in scaling up misinformation-mitigation tasks like Claim Matching, Topical Clustering, and Claim Relationship Classification. Let's dive into what SynDy is, how it works, and why it’s a credible tool for fighting misinformation.

The SynDy Framework

SynDy aims to address a critical gap: the lack of annotated data for emerging and niche topics. Here's a breakdown of how it tackles this:

Dataset Selection

  1. Keyword Generation: For any given topic, SynDy uses keyword generation techniques to create queries.
  2. Social Media Queries: These generated queries are used to scrape social media data (like Reddit or Twitter posts) relevant to the topic.

Synthetic Dataset Annotation

  1. Claim Extraction: Using LLMs, SynDy extracts claims from the gathered social media posts.
  2. Topic Clustering: Posts are clustered based on their topics, helping identify narrative themes.
  3. Claim Relationship Classification: SynDy identifies relationships between claims, classifying them as either supporting or undermining each other.

These tasks provide a labeled dataset that can be used to train models specifically for misinformation-mitigation tasks.

Real-World Challenges Addressed

Neglected Topics in Mainstream Misinformation

Many diaspora communities encounter misinformation that is often ignored by mainstream fact-checking. SynDy aims to bridge this gap by creating datasets for these underrepresented topics, allowing smaller organizations to leverage LLMs without the need for extensive, human-annotated data.

Scaling Misinformation Response Efforts

Traditional fact-checking is usually reactive—addressing false claims as they appear. SynDy helps shift this paradigm by clustering claims into larger narrative themes. This enables a proactive approach, where high-quality explainer content and pre-bunking materials can be prepared in advance.

Experimental Results

The framework was tested on real-world datasets to evaluate its efficacy. The results were promising:

  • Claim Matching: SynDy-trained models showed a 12.5% improvement over a zero-shot baseline in MAP@20.
  • Topical Clustering: Models trained on SynDy-generated labels performed similarly to those trained on human-annotated data, with the Perspectrum dataset seeing even better performance.
  • Claim Relationship Classification: While the models trained on synthetic data lagged slightly behind those trained on human data, they still performed robustly with a Macro-F1 score of 0.863 compared to 0.907.

Overall, these results indicate that SynDy can effectively substitute for human-annotated data in many cases, providing a feasible and scalable solution for misinformation mitigation.

Implications and Future Directions

The practical implications of SynDy can be summarized as follows:

  • Cost Efficiency: Generating synthetic data is far less expensive and faster than manual annotation, making large-scale fact-checking efforts more feasible.
  • Broader Coverage: By focusing on neglected topics, SynDy helps bring attention to misinformation affecting smaller, often overlooked communities.
  • Scalable Solutions: Integrating SynDy into platforms like Meedan’s tiplines can significantly expand the capabilities of misinformation-response initiatives.

In the future, SynDy could evolve to include more sophisticated tasks and improve its relationship classification algorithms. More realistic evaluation scenarios and real-time application in misinformation detection pipelines could further validate and refine the framework.

Conclusion

SynDy represents a practical and effective framework for generating synthetic datasets aimed at misinformation mitigation. By leveraging LLMs, the framework can tackle tasks like Claim Matching, Topical Clustering, and Claim Relationship Classification, providing essential tools to scale up human-led fact-checking efforts. The promising experimental results indicate that models trained on SynDy-generated labels can perform comparably to those trained on human-annotated data. This makes SynDy a valuable asset for researchers, journalists, and fact-checkers in the ongoing battle against misinformation.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 13 likes.

Upgrade to Pro to view all of the tweets about this paper: