AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild (2405.11697v2)

Published 19 May 2024 in cs.CY

Abstract: The prevalence and harms of online misinformation is a perennial concern for internet platforms, institutions and society at large. Over time, information shared online has become more media-heavy and misinformation has readily adapted to these new modalities. The rise of generative AI-based tools, which provide widely-accessible methods for synthesizing realistic audio, images, video and human-like text, have amplified these concerns. Despite intense public interest and significant press coverage, quantitative information on the prevalence and modality of media-based misinformation remains scarce. Here, we present the results of a two-year study using human raters to annotate online media-based misinformation, mostly focusing on images, based on claims assessed in a large sample of publicly-accessible fact checks with the ClaimReview markup. We present an image typology, designed to capture aspects of the image and manipulation relevant to the image's role in the misinformation claim. We visualize the distribution of these types over time. We show the rise of generative AI-based content in misinformation claims, and that its commonality is a relatively recent phenomenon, occurring significantly after heavy press coverage. We also show "simple" methods dominated historically, particularly context manipulations, and continued to hold a majority as of the end of data collection in November 2023. The dataset, Annotated Misinformation, Media-Based (AMMeBa), is publicly-available, and we hope that these data will serve as both a means of evaluating mitigation methods in a realistic setting and as a first-of-its-kind census of the types and modalities of online misinformation.

References (39)

Authors (11)

Nicholas Dufour (3 papers)
Arkanath Pathak (5 papers)
Pouya Samangouei (9 papers)
Nikki Hariri (1 paper)
Shashi Deshetti (1 paper)
Andrew Dudfield (1 paper)
Christopher Guess (1 paper)
Pablo Hernández Escayola (1 paper)
Bobby Tran (1 paper)
Mevan Babakar (2 papers)
Christoph Bregler (7 papers)

Citations (8)

View on Semantic Scholar

Summary

Overview of Media-Based Misinformation Claims

The paper provides a comprehensive survey of media-based misinformation claims, using publicly-available fact checks annotated by human raters. Misinformation is quantified based on empirical data collected over a two-year period, focusing predominantly on image-based misinformation but incorporating media of various modalities.

Data Collection and Methodology

Misinformation claims were sampled from 135,838 fact checks available online, with annotations performed in various stages to ensure comprehensiveness and accuracy. Data were gathered predominantly using the ClaimReview markup, focusing on English-language fact checks. The annotated dataset, termed Annotated Misinformation, Media-Based (AMMeBa), was designed to operationalize broad and granular insights into the prevalence and categorization of media-based misinformation. Annotations were conducted by a relatively small but carefully trained pool of 83 raters.

Trends in Media-Based Misinformation

Approximately 80% of misinformation claims involved media. Over time, there has been an increasing prevalence of video-based misinformation, particularly from 2022 onwards, correlating with the growing popularity of video-sharing platforms. While images historically dominated, the shift towards video is noted but does not denote any overt reduction in other modalities.

Image Classification and Typology

The paper introduces a typology for images that categorizes them into "basic" images, "complex" images, and specific subcategories like "screenshots" and "analog gap" images. Basic images lack additional graphical elements, while complex images include alterations such as text overlays or multiple sub-images.

Manipulation Types

Image-based misinformation was further classified into content manipulations, context manipulations, and text-based images:

Content Manipulations: Include altered pixels within images, such as AI-generated images or text modifications.
Context Manipulations: Use authentic images but offer misleading information about their context (e.g., time, location, identity).
Text-Based Images: Misinformation rendered through overlaid text which articulates false claims.

Key Findings

Content Manipulations Dominance: Contrary to the popular focus on AI-generated content, simple content manipulations remain prevalent. Context manipulations, which require minimal technological sophistication, were far more common than manipulations created using sophisticated tools like generative AI.
AI-Generated Content: AI-generated images appeared significantly only after Spring 2023. Despite increasing media attention and frequent mentions in the press, AI-generated content is still outnumbered by simpler manipulation methods.
Text Use in Images: Text was identified as a substantial element in misinformation images, with approximately 80% of them bearing text content. Misinformation claims often depended materially on this text, especially in cases featuring self-contextualizing images.
Screenshots: Screenshots, particularly from social media, are prevalent. Non-social media screenshots often simulate official communications, aiming to leverage the implied authority of such formats.
Reverse Image Search: Provenance recovery through reverse image search was a common fact-checking technique. There was a notable decline in the use of pre-manipulation images likely since AI-generated content increased.

Implications and Speculation

This paper highlights the need for adaptive strategies in mitigating misinformation. The simplicity of cost-effective context manipulations evidences the necessity for solutions beyond just sophisticated forensic techniques targeting content manipulations.

Future AI systems should prioritize contextual and provenance information to verify the authenticity of images, capitalizing on advances in machine learning and natural language processing. Moreover, the analyses underscore how user-generated content and changes in content consumption (video proliferation) impact misinformation modalities.

Future Research Directions

Broader Language Scope: Expanding the paper to include non-English fact checks would present a more global view of misinformation trends.
Video and Audio Content: Deeper exploration of video-based misinformation, given its increasing prevalence, is paramount.
Continued Monitoring: The rapid adoption of generative AI necessitates ongoing monitoring to capture emerging trends.
Finer-Grained Categorization: Further categorization of context manipulations and other nuanced subtypes could provide detailed insights for developing mitigation tools.

Conclusion

The paper illuminates the pervasive and evolving nature of media-based misinformation. It provides essential data for researchers and developers aiming to build robust, scalable systems to counter misinformation. While generative AI has added complexity to the misinformation landscape, simple context manipulations retain significant influence, indicating both enduring techniques and novel challenges for misinformation mitigation.

The AMMeBa dataset promises to be an invaluable resource for future computational studies, allowing for the empirical grounding of mitigation methods and providing a clearer lens through which to observe and address the multifaceted nature of online misinformation.

References

Note: The references section in the original document would provide detailed citations for all studies and data points mentioned.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/natzir9/status/1814236382342324654

https://twitter.com/kristileilani/status/1795939656586195451

https://twitter.com/fdaudens/status/1793011147223839057

https://twitter.com/gm8xx8/status/1792753755546591578

https://twitter.com/mr_dudders/status/1792852288781881538

https://twitter.com/WGOV/status/1793213476627685612

YouTube

Show All Videos