Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video (2202.12883v4)

Published 25 Feb 2022 in cs.HC and cs.AI

Abstract: Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often when the same version of a story is presented as a video versus text. We conduct 5 pre-registered randomized experiments with 2,215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings, and media modalities. We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.

Overview of "Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video"

The paper "Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video" presents an empirical exploration of human discernment in detecting fabricated political content across various media forms. The goal is to scrutinize the human ability to distinguish between real and deepfake political speeches presented as text, audio, or video.

The crux of the paper lies in its pre-registered randomized experiments spread across five distinct experiments that involved 2,215 participants. These experiments were designed to investigate the influence of media modalities, audio sources, and base rates of misinformation on participants' capacity to discern real political speeches from fabrications.

Key Findings

  1. Modality Influence: The paper reveals that participants perform with greater accuracy when distinguishing between real and deepfake political speeches when provided with multiple media modalities. Specifically, visual and auditory information significantly enhance accuracy over text alone. Experiment 1a demonstrated that accuracy ascends from 57% with text transcripts to 82% with audio-visual modalities.
  2. Effect of Audio Source: In Experiment 2, enhanced deepfakes with text-to-speech (TTS) audio proved more challenging for participants to detect as fabrications than those with voice actor audio. The accuracy dropped to nearly random for TTS-enhanced deepfakes, highlighting the sophisticated nature of modern audio synthesis in deceiving human perception.
  3. Base Rate of Misinformation: The experiments investigated the impact of varying the base rate of fakes. While higher base rates of misinformation slightly altered participants' responses, they consistently achieved improved accuracy when both video and audio cues were available, especially for correctly identifying fabrications.
  4. Subtle Primes and Unsuspected Environments: Experiment 5 demonstrated that even when participants are not explicitly questioning the authenticity, exposure to realistic deepfake examples can cultivate suspicion, illustrating the potential for subtle cues to trigger discernment under real-world viewing conditions.

Implications

The findings from this paper carry significant ramifications for the future of multimedia and misinformation research. It sheds light on the complexities of human-media interaction, suggesting that while enhanced multimedia can bolster discernment accuracy, it simultaneously increases the challenge of detecting highly convincing fakes. This nuanced understanding provides a precautionary lens through which future AI developments in media synthesis should be examined.

From a theoretical standpoint, the research challenges the simplistic notion that seeing is always believing. Instead, it argues for a more intricate interplay between media modality, perceptual cues, and prior expectations in shaping human belief and discernment.

Practical and Theoretical Speculations

Practically, these observations suggest that multimedia platforms might need sophisticated moderation tools that disaggregate modality components, enabling users to assess authenticity with explicit cues regarding which elements of a media piece might be manipulated. Theoretically, it opens avenues for deeper examination into cognitive biases and capabilities concerning audiovisual synthesis, emphasizing the need for interdisciplinary approaches in tackling misinformation.

Future research could expand this paper's scope by speculating on the implications of hyper-realistic deepfakes on social dynamics and trust in digital environments. As AI techniques evolve, understanding human perception's thresholds and biases will be crucial in forecasting media ecosystem advancements and establishing robust countermeasures against digital deception.

In summary, this paper provides a data-driven exploration into the nuances of human discernment of deepfake media, highlighting both the technological sophistication of current AI applications and the enduring complexities of human perception. The ongoing interplay between these factors will likely continue to shape the landscape of political communication and digital trust.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Matthew Groh (20 papers)
  2. Aruna Sankaranarayanan (7 papers)
  3. Nikhil Singh (25 papers)
  4. Dong Young Kim (3 papers)
  5. Andrew Lippman (4 papers)
  6. Rosalind Picard (26 papers)
Citations (9)
Youtube Logo Streamline Icon: https://streamlinehq.com