Audio Adversarial Examples: Targeted Attacks on Speech-to-Text (1801.01944v2)

Published 5 Jan 2018 in cs.LG, cs.AI, and cs.CR

Abstract: We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (recognizing up to 50 characters per second of audio). We apply our white-box iterative optimization-based attack to Mozilla's implementation DeepSpeech end-to-end, and show it has a 100% success rate. The feasibility of this attack introduce a new domain to study adversarial examples.

Authors (2)

Nicholas Carlini (101 papers)
David Wagner (67 papers)

Citations (1,041)

View on Semantic Scholar

Summary

Targeted Adversarial Attacks on Automatic Speech Recognition Systems

The paper, "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" by Nicholas Carlini and David Wagner presents a critical exploration into adversarial examples within the domain of automatic speech recognition (ASR). The authors devise methods to create targeted adversarial examples for ASR systems and demonstrate their attacks on Mozilla's implementation of DeepSpeech.

Key Contributions and Findings

Adversarial Attack Methodology: The primary contribution is the formulation of targeted audio adversarial attacks that can transcribe any input waveform into any chosen phrase. By iteratively optimizing the adversarial perturbation, the authors managed to achieve a 100% success rate with attacks, maintaining perturbations that are on average -31dB in decibel level, making them nearly inaudible to the human ear.
Evaluating DeepSpeech: By conducting white-box attacks on DeepSpeech, the authors highlight the vulnerability of state-of-the-art ASR models. The success in transcribing up to 50 characters per second of audio demonstrates the precision with which adversarial perturbations can manipulate ASR outputs.
Refinement of Loss Function: An improved loss function was proposed to reduce the required perturbation. By focusing optimization on altering only the necessary parts of the audio waveform, the authors reduced perturbations to a mean of -38dB, albeit this result was primarily effective against greedy decoders as opposed to beam-search decoders.
Evaluation on Non-Speech: The authors extended their techniques to non-speech audio inputs like classical music clips, demonstrating that arbitrary sounds can also be manipulated to produce targeted transcriptions. This highlights the transferability of their methods across different types of audio data.
Targeting Silence: Additionally, the research presents an attack where adversarial noise can cause a feature-rich audio clip to be transcribed as silence. This contributes to the discussion on the robustness and applicability of adversarial attacks in different scenarios within ASR.

Theoretical and Practical Implications

Theoretical Implications

Understanding ASR Vulnerabilities: The paper expands the understanding of neural network vulnerabilities beyond visual data to audio data, solidifying the notion that ASRs are not immune to adversarial perturbations.
Loss Function Optimization: The advancement in loss function design, moving from CTC loss to a more targeted logit-based loss, sheds light on the importance of loss function selection and its impact on the efficacy of adversarial examples.
Local Linearity Hypothesis: The experiments contrasted targeted audio adversarial examples with the Fast Gradient Sign Method (FGSM), providing evidence that single-step, linear methods like FGSM are insufficient on their own for crafting effective audio adversarial examples due to the non-linearity introduced by signal processing steps and the recurrent nature of ASR systems.

Practical Implications

Security Concerns: Practical implications for real-world systems are significant. ASR systems deployed in consumer devices (e.g., Siri, Google Assistant) could be manipulated to transcribe incorrect or malicious phrases.
Robustness and Defense Mechanisms: The findings challenge developers to enhance the robustness of ASRs against such adversarial attacks. Potential defenses need to account for the unique characteristics of audio data, different from those applicable in the image domain.
Future Applications and Research: The research opens pathways for further explorations into adversarial robustness, both through advancing existing defenses and understanding how these attacks can be adapted or countered in more complex, real-world scenarios such as over-the-air attacks or noisy environments.

Open Questions and Future Work

Physical World Robustness: The next step is to extend the adversarial attacks to be effective over-the-air. Given prior work in images successfully accomplished physical world attacks, similar advancements in the audio domain could significantly enhance the impact of this research.
Universal Perturbations and Transferability: The concept of universal adversarial perturbations, as well as the transferability of such attacks across different ASR models, remains an open area of investigation. These characteristics could amplify the risks posed by adversarial attacks drastically.
Comprehensive Defence Evaluation: Defensive techniques validated in image domains need to be rigorously tested against audio adversarial examples to ensure holistic security in ASR systems.

Final Remarks

The paper "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" by Carlini and Wagner provides substantial insights and methodologies related to the vulnerability of ASR systems to adversarial attacks. The implications extend beyond theoretical interest, posing genuine challenges and opportunities for enhancing security and robustness in speech recognition technologies. It is imperative that ongoing research builds on these foundations to develop resilient ASR systems given their pervasive applications in modern technology.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos