SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks (2405.10700v1)
Abstract: Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier LLMs to train local, specialized LLMs. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up human-led fact-checking at a fraction of the cost of human-annotated data. Training on SynDy's generated labels shows improvement over a standard baseline and is not significantly worse compared to training on human labels (which may be infeasible to acquire). SynDy is being integrated into Meedan's chatbot tiplines that are used by over 50 organizations, serve over 230K users annually, and automatically distribute human-written fact-checks via messaging apps such as WhatsApp. SynDy will also be integrated into our deployed Co-Insights toolkit, enabling low-resource organizations to launch tiplines for their communities. Finally, we envision SynDy enabling additional fact-checking tools such as matching new misinformation claims to high-quality explainers on common misinformation topics.
- Misinformation on Misinformation: Conceptual and Methodological Challenges. Social Media + Society 9, 1 (2023). https://doi.org/10.1177/20563051221150412
- Twitter Topic Classification. ArXiv abs/2209.09824 (2022).
- Phoebe Arnold. 2020. The challenges of online fact checking. Full fact 17 (2020).
- Salil D. Benegal and Lyle A. Scruggs. 2018. Correcting misinformation about climate change: the impact of partisanship in an experimental setting. Climatic Change 148, 1 (2018), 61–80. https://doi.org/10.1007/s10584-018-2192-4
- Steve Blank and Bob Dorf. 2020. The startup owner’s manual: The step-by-step guide for building a great company. John Wiley & Sons.
- Seeing Things from a Different Angle: Discovering Diverse Perspectives about Claims. CoRR abs/1906.03538 (2019). arXiv:1906.03538 http://arxiv.org/abs/1906.03538
- The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering. arXiv:2305.16519 [cs.CL]
- Owen Dyer. 2017. Measles outbreak in Somali American community follows anti-vaccine talks. BMJ 357 (2017). https://doi.org/10.1136/bmj.j2378 arXiv:https://www.bmj.com/content/357/bmj.j2378.full.pdf
- Robin Emsley. 2023. ChatGPT: these are not hallucinations–they’re fabrications and falsifications. Schizophrenia 9, 1 (2023), 52.
- The impact of misinformation on the COVID-19 pandemic. AIMS Public Health 9, 2 (2022). https://doi.org/10.3934%2Fpublichealth.2022018
- Full Fact. 2019. Report on the Facebook Third Party Fact Checking programme.
- D Graves. 2018. Understanding the promise and limits of automated fact-checking. Reuters Institute for the Study of Journalism (2018).
- Chatbot Confabulations Are Not Hallucinations. JAMA Internal Medicine 183, 10 (10 2023), 1177–1177. https://doi.org/10.1001/jamainternmed.2023.4231
- The role of non–COVID-specific and COVID-specific factors in predicting a shift in willingness to vaccinate: A panel study. Proceedings of the National Academy of Sciences 118, 52 (2021), e2112266118. https://doi.org/10.1073/pnas.2112266118 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2112266118
- Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4504–4517. https://doi.org/10.18653/v1/2021.acl-long.347
- It could have been much worse: The Minnesota measles outbreak of 2017. Vaccine 36, 14 (2018), 1808–1810. https://doi.org/10.1016/j.vaccine.2018.02.086
- Beyond Misinformation: Understanding and Coping with the “Post-Truth” Era. Journal of Applied Research in Memory and Cognition 6, 4 (2017), 353–369. https://doi.org/10.1016/j.jarmac.2017.07.008
- Stephan Lewandowsky and Sander Van Der Linden. 2021. Countering misinformation and fake news through inoculation and prebunking. European Review of Social Psychology 32, 2 (2021), 348–384.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
- Meedan, Inc. 2022. Annual Report 2022. https://meedan.com/post/annual-report-2022
- Overview of the CLEF-2022 CheckThat! Lab on Fighting the COVID-19 Infodemic and Fake News Detection. In Proceedings of the 13th International Conference of the CLEF Association: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization (CLEF ’2022). Bologna, Italy.
- Overview of the CLEF-2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. CoRR abs/2109.12987 (2021). arXiv:2109.12987 https://arxiv.org/abs/2109.12987
- Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. CoRR abs/2108.08877 (2021). arXiv:2108.08877 https://arxiv.org/abs/2108.08877
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Yotam Ophir and Kathleen Hall Jamieson. 2021. The effects of media narratives about failures and discoveries in science on beliefs about and support for science. Public Understanding of Science 30, 8 (2021), 1008–1023. https://doi.org/10.1177/09636625211012630 PMID: 34000907.
- Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
- Social and Cognitive Aspects of the Vulnerability to Political Misinformation. Political Psychology 42, S1 (Dec. 2021), 267–304. https://doi.org/10.1111/pops.12797
- Gordon Pennycook and David G. Rand. 2021. The Psychology of Fake News. Trends in Cognitive Sciences 25, 5 (May 2021), 388–402. https://doi.org/10.1016/j.tics.2021.02.007
- Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal. Scientific Reports 12, 1 (2022). https://doi.org/10.1038/s41598-022-10070-w
- The Covid-19 Infodemic — Applying the Epidemiologic Model to Counter Misinformation. New England Journal of Medicine (2021). https://doi.org/10.1056/NEJMp2103798
- That is a Known Lie: Detecting Previously Fact-Checked Claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 3607–3618. https://doi.org/10.18653/v1/2020.acl-main.332
- Cristian Vaccari and Marco Morini. 2014. The Power of Smears in Two American Presidential Campaigns. Journal of Political Marketing 13, 1-2 (2014), 19–45. https://doi.org/10.1080/15377857.2014.866021
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.