Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding (1808.05665v2)

Published 16 Aug 2018 in cs.CR, cs.SD, and eess.AS

Abstract: Voice interfaces are becoming accepted widely as input methods for a diverse set of devices. This development is driven by rapid improvements in automatic speech recognition (ASR), which now performs on par with human listening in many tasks. These improvements base on an ongoing evolution of DNNs as the computational core of ASR. However, recent research results show that DNNs are vulnerable to adversarial perturbations, which allow attackers to force the transcription into a malicious output. In this paper, we introduce a new type of adversarial examples based on psychoacoustic hiding. Our attack exploits the characteristics of DNN-based ASR systems, where we extend the original analysis procedure by an additional backpropagation step. We use this backpropagation to learn the degrees of freedom for the adversarial perturbation of the input signal, i.e., we apply a psychoacoustic model and manipulate the acoustic signal below the thresholds of human perception. To further minimize the perceptibility of the perturbations, we use forced alignment to find the best fitting temporal alignment between the original audio sample and the malicious target transcription. These extensions allow us to embed an arbitrary audio input with a malicious voice command that is then transcribed by the ASR system, with the audio signal remaining barely distinguishable from the original signal. In an experimental evaluation, we attack the state-of-the-art speech recognition system Kaldi and determine the best performing parameter and analysis setup for different types of input. Our results show that we are successful in up to 98% of cases with a computational effort of fewer than two minutes for a ten-second audio file. Based on user studies, we found that none of our target transcriptions were audible to human listeners, who still understand the original speech content with unchanged accuracy.

Adversarial Attacks on ASR via Psychoacoustic Hiding

The paper discusses a novel approach to adversarial attacks on Automatic Speech Recognition (ASR) systems, focusing on deep neural network (DNN) vulnerabilities. The researchers exploit psychoacoustic models to hide adversarial perturbations beneath human auditory thresholds. Their method leverages backpropagation to subtly modify audio inputs, ensuring that DNN-driven ASR systems produce malicious transcriptions while leaving audio signals nearly indistinguishable from the original.

Key Contributions

  1. Psychoacoustic Hiding Approach: By utilizing psychoacoustic models, the attack minimizes perceptible audio distortion, ensuring adversarial perturbations remain below human perceptibility thresholds. This aspect is critical, as it enhances the stealth of adversarial examples, making them less detectable in practical applications.
  2. Integration With Preprocessing: The preprocessing stage of ASR, crucial for feature extraction, is integrated with the DNN to streamline the backpropagation process. This allows for direct modifications to raw audio during adversarial training, enhancing efficiency and reducing complexity compared to indirect methods.
  3. Forced Alignment: The attack algorithm incorporates a forced alignment strategy to optimize the temporal fit between original audio and target transcriptions. This ensures the attack exploits temporal dynamics effectively, a crucial consideration given the time-dependent nature of audio data.
  4. Evaluation Against Kaldi ASR: The method was tested against Kaldi, a state-of-the-art DNN-HMM ASR system. Remarkably, the proposed attack achieved a high success rate of up to 98% in generating adversarial samples, transcribing the desired malicious output with minimal noise.
  5. User Study Validation: A two-part user paper was conducted, demonstrating that human listeners were unable to discern the embedded adversarial transcriptions. The original speech remained clear, indicating the efficacy of psychoacoustic hiding in practical scenarios.

Implications and Future Directions

From a theoretical standpoint, this research highlights significant vulnerabilities in DNN-based ASR systems when exposed to carefully crafted inputs that leverage perceptual limitations. The incorporation of psychoacoustic models represents a significant enhancement in the subtlety of adversarial techniques, posing challenges to the conventional defenses employed in machine learning systems.

Practically, the research suggests the need for ASR developers to integrate perceptual models into their training and evaluation processes, potentially adopting more robust defense strategies that consider these nuanced attacks. Future research may explore the extension of psychoacoustic models to other domains of sensory input, like visual or tactile data, widening the spectrum of adversarial techniques.

This work opens avenues for creating adversarial attacks that account for human perceptual weaknesses, pressing for adaptive, perceptually aware defenses. Understanding the delicate balance between human perception and machine vulnerabilities remains crucial in securing ASR and related AI systems against such sophisticated attack vectors.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lea Schönherr (23 papers)
  2. Katharina Kohls (5 papers)
  3. Steffen Zeiler (8 papers)
  4. Thorsten Holz (52 papers)
  5. Dorothea Kolossa (33 papers)
Citations (275)