Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild (2405.11697v2)

Published 19 May 2024 in cs.CY

Abstract: The prevalence and harms of online misinformation is a perennial concern for internet platforms, institutions and society at large. Over time, information shared online has become more media-heavy and misinformation has readily adapted to these new modalities. The rise of generative AI-based tools, which provide widely-accessible methods for synthesizing realistic audio, images, video and human-like text, have amplified these concerns. Despite intense public interest and significant press coverage, quantitative information on the prevalence and modality of media-based misinformation remains scarce. Here, we present the results of a two-year study using human raters to annotate online media-based misinformation, mostly focusing on images, based on claims assessed in a large sample of publicly-accessible fact checks with the ClaimReview markup. We present an image typology, designed to capture aspects of the image and manipulation relevant to the image's role in the misinformation claim. We visualize the distribution of these types over time. We show the rise of generative AI-based content in misinformation claims, and that its commonality is a relatively recent phenomenon, occurring significantly after heavy press coverage. We also show "simple" methods dominated historically, particularly context manipulations, and continued to hold a majority as of the end of data collection in November 2023. The dataset, Annotated Misinformation, Media-Based (AMMeBa), is publicly-available, and we hope that these data will serve as both a means of evaluating mitigation methods in a realistic setting and as a first-of-its-kind census of the types and modalities of online misinformation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Midjourney user prompts & generated images (250k). https://www.kaggle.com/datasets/succinctlyai/midjourney-texttoimage. Accessed: 2024-03-20.
  2. State of the fact-checkers report 2023. Technical report, International Fact-Checking Network, 2023.
  3. Trends in the diffusion of misinformation on social media. Research & Politics, 6(2):2053168019848554, 2019.
  4. Catching out-of-context misinformation with self-supervised learning. CoRR, abs/2101.06278, 2021.
  5. Toward a theory of visual argument. Argumentation and Advocacy, 33(1):1–10, 1996.
  6. Types, sources, and claims of COVID-19 misinformation. PhD thesis, University of Oxford, 2020.
  7. Determining image origin and integrity using sensor noise. IEEE Transactions on information forensics and security, 3(1):74–90, 2008.
  8. Twigma: A dataset of ai-generated images with metadata from twitter, 2023.
  9. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. L. Rev., 107:1753, 2019.
  10. How spammers and scammers leverage ai-generated images on facebook for audience growth. arXiv preprint arXiv:2403.12838, 2024.
  11. Rebroadcast attacks: Defenses, reattacks, and redefenses. In 2018 26th European Signal Processing Conference (EUSIPCO), pages 942–946. IEEE, 2018.
  12. Fake news on twitter during the 2016 us presidential election. Science, 363(6425):374–378, 2019.
  13. A picture paints a thousand lies? the effects and mechanisms of multimodal disinformation and rebuttals disseminated via social media. Political communication, 37(2):281–301, 2020.
  14. The ps-battles dataset - an image collection for image manipulation detection. CoRR, abs/1804.04866, 2018.
  15. Ipsos/UNESCO. Survey on the impact of online disinformation and hate speech, 2023.
  16. Abstract images have different levels of retrievability per reverse image search engine, 2022.
  17. KFF. Kff misinformation poll snapshot: Public views misinformation as a major problem, feels uncertain about accuracy of information on current events, 2023.
  18. Visual user-generated content verification in journalism: An overview. IEEE Access, 11:6748–6769, 2023.
  19. Is a picture worth a thousand words? an empirical study of image content and social media engagement. Journal of Marketing Research, 57(1):1–19, 2020.
  20. Multi-modal semantic inconsistency detection in social media news posts. In International Conference on Multimedia Modeling, pages 331–343. Springer, 2022.
  21. Mary Meeker. Internet trends 2016, 2016.
  22. Meta. Facebook widely viewed content report: Q3 2023. Technical report, November 2023. Accessed: April 1, 2024. Downloaded from archive: https://transparency.fb.com/data/widely-viewed-content-report?gk_enable=stc_nov_2023#prior-reports.
  23. r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection, 2020.
  24. Misinformed by images: How images influence perceptions of truth and what can be done about it. Current Opinion in Psychology, page 101778, 2023.
  25. Mumin: A large-scale multilingual multimodal fact-checked misinformation social network dataset, 2022.
  26. Deepfakes and cheap fakes. 2019.
  27. Pearson Institute/AP-NORC. The american public views the spread of misinformation as a major problem, 2021.
  28. A short guide to the history of ‘fake news’ and disinformation. International Center for Journalists, 7(2018):2018–07, 2018.
  29. A dataset of fact-checked images shared on whatsapp during the brazilian and indian elections. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 903–908, 2020.
  30. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  31. FEVER: a large-scale dataset for fact extraction and verification. CoRR, abs/1803.05355, 2018.
  32. Luisa Verdoliva. Media forensics and deepfakes: An overview. IEEE Journal of Selected Topics in Signal Processing, 14(5):910–932, August 2020.
  33. William Yang Wang. "liar, liar pants on fire": A new benchmark dataset for fake news detection. CoRR, abs/1705.00648, 2017.
  34. Understanding the use of fauxtography on social media. In Proceedings of the International AAAI Conference on Web and Social Media, volume 15, pages 776–786, 2021.
  35. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models, 2023.
  36. Information disorder: Toward an interdisciplinary framework for research and policymaking, volume 27. Council of Europe Strasbourg, 2017.
  37. Visual disinformation in a digital age: A literature synthesis and research agenda. new media & society, 25(12):3696–3713, 2023.
  38. Visual misinformation on facebook. Journal of Communication, 73(4):316–328, 2023.
  39. Genimage: A million-scale benchmark for detecting ai-generated image, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Nicholas Dufour (3 papers)
  2. Arkanath Pathak (5 papers)
  3. Pouya Samangouei (9 papers)
  4. Nikki Hariri (1 paper)
  5. Shashi Deshetti (1 paper)
  6. Andrew Dudfield (1 paper)
  7. Christopher Guess (1 paper)
  8. Pablo Hernández Escayola (1 paper)
  9. Bobby Tran (1 paper)
  10. Mevan Babakar (2 papers)
  11. Christoph Bregler (7 papers)
Citations (8)

Summary

Overview of Media-Based Misinformation Claims

The paper provides a comprehensive survey of media-based misinformation claims, using publicly-available fact checks annotated by human raters. Misinformation is quantified based on empirical data collected over a two-year period, focusing predominantly on image-based misinformation but incorporating media of various modalities.

Data Collection and Methodology

Misinformation claims were sampled from 135,838 fact checks available online, with annotations performed in various stages to ensure comprehensiveness and accuracy. Data were gathered predominantly using the ClaimReview markup, focusing on English-language fact checks. The annotated dataset, termed Annotated Misinformation, Media-Based (AMMeBa), was designed to operationalize broad and granular insights into the prevalence and categorization of media-based misinformation. Annotations were conducted by a relatively small but carefully trained pool of 83 raters.

Trends in Media-Based Misinformation

Approximately 80% of misinformation claims involved media. Over time, there has been an increasing prevalence of video-based misinformation, particularly from 2022 onwards, correlating with the growing popularity of video-sharing platforms. While images historically dominated, the shift towards video is noted but does not denote any overt reduction in other modalities.

Image Classification and Typology

The paper introduces a typology for images that categorizes them into "basic" images, "complex" images, and specific subcategories like "screenshots" and "analog gap" images. Basic images lack additional graphical elements, while complex images include alterations such as text overlays or multiple sub-images.

Manipulation Types

Image-based misinformation was further classified into content manipulations, context manipulations, and text-based images:

  • Content Manipulations: Include altered pixels within images, such as AI-generated images or text modifications.
  • Context Manipulations: Use authentic images but offer misleading information about their context (e.g., time, location, identity).
  • Text-Based Images: Misinformation rendered through overlaid text which articulates false claims.

Key Findings

  1. Content Manipulations Dominance: Contrary to the popular focus on AI-generated content, simple content manipulations remain prevalent. Context manipulations, which require minimal technological sophistication, were far more common than manipulations created using sophisticated tools like generative AI.
  2. AI-Generated Content: AI-generated images appeared significantly only after Spring 2023. Despite increasing media attention and frequent mentions in the press, AI-generated content is still outnumbered by simpler manipulation methods.
  3. Text Use in Images: Text was identified as a substantial element in misinformation images, with approximately 80% of them bearing text content. Misinformation claims often depended materially on this text, especially in cases featuring self-contextualizing images.
  4. Screenshots: Screenshots, particularly from social media, are prevalent. Non-social media screenshots often simulate official communications, aiming to leverage the implied authority of such formats.
  5. Reverse Image Search: Provenance recovery through reverse image search was a common fact-checking technique. There was a notable decline in the use of pre-manipulation images likely since AI-generated content increased.

Implications and Speculation

This paper highlights the need for adaptive strategies in mitigating misinformation. The simplicity of cost-effective context manipulations evidences the necessity for solutions beyond just sophisticated forensic techniques targeting content manipulations.

Future AI systems should prioritize contextual and provenance information to verify the authenticity of images, capitalizing on advances in machine learning and natural language processing. Moreover, the analyses underscore how user-generated content and changes in content consumption (video proliferation) impact misinformation modalities.

Future Research Directions

  1. Broader Language Scope: Expanding the paper to include non-English fact checks would present a more global view of misinformation trends.
  2. Video and Audio Content: Deeper exploration of video-based misinformation, given its increasing prevalence, is paramount.
  3. Continued Monitoring: The rapid adoption of generative AI necessitates ongoing monitoring to capture emerging trends.
  4. Finer-Grained Categorization: Further categorization of context manipulations and other nuanced subtypes could provide detailed insights for developing mitigation tools.

Conclusion

The paper illuminates the pervasive and evolving nature of media-based misinformation. It provides essential data for researchers and developers aiming to build robust, scalable systems to counter misinformation. While generative AI has added complexity to the misinformation landscape, simple context manipulations retain significant influence, indicating both enduring techniques and novel challenges for misinformation mitigation.

The AMMeBa dataset promises to be an invaluable resource for future computational studies, allowing for the empirical grounding of mitigation methods and providing a clearer lens through which to observe and address the multifaceted nature of online misinformation.

References

Note: The references section in the original document would provide detailed citations for all studies and data points mentioned.

Youtube Logo Streamline Icon: https://streamlinehq.com