Targeted Audio Adversarial Attacks
- Targeted audio adversarial attacks are precise perturbations engineered to force a specific output from audio models while remaining nearly imperceptible to human listeners.
- They employ diverse methodologies including gradient-based white-box, query-efficient black-box, and universal perturbations to achieve high success rates on ASR, speaker recognition, and multimodal systems.
- These attacks highlight significant security risks in audio processing, prompting research into robust defenses such as anomaly detection, adversarial training, and input transformation techniques.
Targeted audio adversarial attacks are algorithmically crafted acoustic perturbations designed to mislead deep learning-based audio systems so that they produce a specific, attacker-chosen output. Unlike untargeted attacks, which only seek any erroneous behavior, targeted attacks force a model—such as an ASR, speaker recognizer, sound event detector, or audio-LLM—to emit a pre-defined transcription, identification, or response, even when the adversarial audio is nearly indistinguishable from the original signal to human listeners. The study of such attacks encompasses a broad methodological spectrum, encompassing gradient-based white-box attacks, query-efficient black-box strategies, universal and highly transferable perturbations, and recent techniques attacking multimodal audio-text systems in both digital and physical channels.
1. Foundational Principles and Taxonomy
The classic targeted audio adversarial attack is characterized by (1) an explicit optimization objective aligned with a chosen output (e.g., CTC loss for a phrase in speech-to-text), (2) a constraint enforcing imperceptibility—through norm bounds (e.g., , ), psychoacoustic masking, or SNR caps, and (3) iterative signal generation via white-box gradients or black-box search. The general form is:
where measures deviation from the target output, is a perceptual penalty (e.g., signal energy), and encodes domain constraints.
Targets include:
- Automatic Speech Recognition (ASR): Force transcription of to any phrase (Carlini et al., 2018)
- Speaker Recognition: Force embedding of to match a particular speaker’s identity (Wang et al., 2020, Duan et al., 2023)
- Sound Event Detection (SED): Switch detection boundaries for specified events at precise times (Su et al., 2 Oct 2025)
- Speech Translation (ST): Manipulate the cross-lingual output for targeted semantic leakage (Liu et al., 2 Mar 2025)
- Multimodal Audio-LLMs (ALLMs): Force specific outputs or jailbreak model safety (Ziv et al., 29 Dec 2025, Sadasivan et al., 7 Jul 2025, Kim et al., 5 Aug 2025)
A natural taxonomy distinguishes:
- White-box vs Black-box
- Instance-specific vs Universal
- Transfer-based vs Query-based
- Digital vs Over-the-air deployment
- Waveform vs Latent-space attacks
2. Algorithmic Methodologies
White-box attacks exploit full model access for gradient-based optimization. Carlini & Wagner introduce iterative CTC-based loss minimization for ASR (Carlini et al., 2018, Alparslan et al., 2020), yielding 100% targeted success at 30 dB SNR and similarity. The generalized update is:
Stagewise improvements, such as local segment perturbation (FAAG (Miao et al., 2022)) or psychoacoustic masking (masking-threshold filtering (Wang et al., 2020)), tailor the attack to minimize both objective and subjective distortion.
Black-box approaches combine global search (e.g., genetic algorithms (Taori et al., 2018)) and local search (e.g., finite-difference gradient estimation) to optimize a surrogate loss without model internals. These achieve moderate targeted success rates (35–40%), with strong similarity metrics (95%) and precise query management.
Universal attacks seek a single perturbation robust across inputs. Abdoli et al. employ an iterative penalty method for batchwise hinge loss minimization, producing targeted UAPs with 85% ASR on classification networks, with an -based SPL penalty for imperceptibility (Abdoli et al., 2019). Recent latent-space UAPs inject a universal perturbation at the encoder, effecting attacker-chosen output in downstream ALLMs without needing decoder access (Ziv et al., 29 Dec 2025).
Transfer-based attacks craft adversarial examples on a surrogate model aiming for transferability. Notably, self-supervised learning (SSL) pretraining dramatically boosts targeted transfer rates for ASR (up to 80% at 30 dB SNR), as demonstrated by (Olivier et al., 2022). Approach refinements include audio score-matching (TransAudio (Gege et al., 2023)) for improved generalization and precise contextual control (e.g., word-level deletion/insertion/substitution).
Recent advanced schemes embed targeted payloads in benign carriers, perform reward-guided exploration (RL-PGD (Kim et al., 5 Aug 2025)), and integrate input augmentations for over-the-air robustness (Sadasivan et al., 7 Jul 2025, Liu et al., 2 Mar 2025).
3. Target Domains and Attack Contexts
Targeted audio adversarial attacks span diverse application domains:
Speech-to-Text (ASR)
White-box CTC-based attacks can embed any transcription, with even non-speech “music-to-speech” transformations validated (Carlini et al., 2018, Alparslan et al., 2020). Black-box methods blend evolutionary search with gradient estimation to achieve 89% transcription similarity (Taori et al., 2018). Fast localized attacks (FAAG) segment and perturb only the initial frames for speed and yield high success (>85%) (Miao et al., 2022).
Speaker Recognition
Psychoacoustic masking is leveraged to generate inaudible, highly successful targeted attacks (up to 98.5% success across gender splits), even on irrelevant carriers such as music (Wang et al., 2020). Fully black-box settings with only a few seconds of reference speech can bring about 48–81% real-world attack success against deployed devices using parrot-training pipelines (Duan et al., 2023).
Sound Event Detection
M2A attacks on polyphonic SED systems maximize targeted event editing while preserving all other detection regions using a dual-loss (adversarial plus preservation) scheme. They achieve over 99% editing precision at high SNR (Su et al., 2 Oct 2025).
Multimodal and Generative Models
Jailbreaking ALLMs requires defeating discretizing encoders and behavioral stochasticity. Modern attacks (RL-PGD (Kim et al., 5 Aug 2025), latent UAPs (Ziv et al., 29 Dec 2025)) bypass safety interceptors and force arbitrary, even harmful, completions in models like Qwen2-Audio, with up to 100% digital ASR and >86% success under human and automatic safety audits (Sadasivan et al., 7 Jul 2025, Kim et al., 5 Aug 2025).
Speech Translation
Targeted attacks on end-to-end ST frameworks induce specific, cross-lingual semantic leakage. Methods include both signal perturbation and adversarial music, the latter enabling stealth injection imperceptible to listeners yet highly effective over-the-air (Liu et al., 2 Mar 2025).
4. Perceptual Stealth, Robustness, and Evaluation
Imperceptibility is maintained through diverse mechanisms:
- -norm bounds on waveform perturbations, commonly with –$0.02$, or SNR constraints of 30 dB (Carlini et al., 2018, Olivier et al., 2022).
- Psychoacoustic masking penalties that enforce perturbations below the human masking threshold across frequency bands (Wang et al., 2020).
- Score-matching losses and human-listener ABX or MOS testing (Gege et al., 2023, Wang et al., 2020).
- Quantitative metrics: audio correlation, STOI, PESQ, SNR, cross-correlation, edit distance, and model-specific targeted success rates.
- Over-the-air robustness is addressed via signal augmentations (e.g., temporal translation, additive noise, SpecAugment (Sadasivan et al., 7 Jul 2025)), bandpass filtering (Liu et al., 2 Mar 2025), and randomized room response during generation.
Table: Selected Attack Success and Stealth Metrics
| Domain | Attack / Paper | Digital Success | OTA Success | SNR / Metric |
|---|---|---|---|---|
| ASR | C&W (Carlini et al., 2018) | 100% | — | dB |
| Speaker Recog. | Psychoacoustic (Wang et al., 2020) | 98.5% | — | ABX pref. 68% |
| ALLM | PGD+Aug (Sadasivan et al., 7 Jul 2025) | 100% | 100% | |
| SED | M2A (Su et al., 2 Oct 2025) | 99.1% EP | — | dB |
| ST | Music+Bandpass (Liu et al., 2 Mar 2025) | 90–100% | 50–67% | PESQ 4.0–4.5 |
Editing Precision (EP), attack success rate (ASR), and MOS are tailored for task and evaluation context. Use of augmentation and cross-lingual/cross-device validation provides practical security context.
5. Transferability and Universality
Transferability is central for real-world black-box attack risk:
- Self-supervised pretraining in ASR models leads to high targeted transferability: multi-proxy crafted perturbations reach up to 80% target-match rates on black-box models (Olivier et al., 2022).
- Universal perturbations, both iterative and penalty-based, achieve 85–97% targeted ASR on CNN classifiers (Abdoli et al., 2019) and enable attacks decoupled from input audio (Ziv et al., 29 Dec 2025).
- Environment-conditioned universal attacks against speaker ID achieve 47% success in over-the-air tests using environmental carrier signals (Duan et al., 2023).
- In SED and ST, the attacker's ability to transfer single perturbations across models or languages exposes cross-architecture and cross-lingual vulnerabilities (Su et al., 2 Oct 2025, Liu et al., 2 Mar 2025).
Defensive transfer limitation remains an open challenge, as standard denoising/quantization can degrade both attack and legitimate accuracy, but are insufficient against robust, stealthy attacks leveraging contextual or psychoacoustic priors.
6. Security Implications and Defensive Strategies
Demonstrations of targeted audio adversarial attacks in highly deployed modalities—voice assistants, speaker authentication, content moderation, and multilingual LLMs—underline a critical attack surface with broad practical risk. Key implications:
- End-to-end and self-supervised systems are highly exposed, as similarity in learned features enables high transferability absent weight or API access (Olivier et al., 2022, Gege et al., 2023).
- ALLMs and ST systems are acutely vulnerable, with latent attacks bypassing post-encoder safety, and direct optimization over the decoder output achieving 100% digital attack rates (Ziv et al., 29 Dec 2025, Kim et al., 5 Aug 2025).
- “Adversarial music” approaches weaponize benign-appearing sound, enabling stealthy over-the-air or background attacks indistinguishable to humans (Liu et al., 2 Mar 2025).
Defensive approaches studied include:
- Adversarial training with inaudible or transferable perturbations (Wang et al., 2020, Olivier et al., 2022).
- Input transformations: resampling, random band-limiting, noise gating, and neural codec compression (e.g., EnCodec, which can entirely block over-the-air attacks at kbps) (Sadasivan et al., 7 Jul 2025, Liu et al., 2 Mar 2025).
- Psychoacoustic anomaly detection: flagging spectral artifacts above masking thresholds (Wang et al., 2020).
- Latent-space anomaly detectors and robust feature extraction at the encoder (Ziv et al., 29 Dec 2025).
- Benign signal pre-padding to defeat localized attacks (FAAG (Miao et al., 2022)).
- Preservation-aware optimization or ensemble modeling to defeat precise event editing attacks (Su et al., 2 Oct 2025).
No single strategy is fully effective—hybrid approaches combining robust training, randomized input processing, and online anomaly detection are actively explored.
7. Research Directions and Open Problems
Open questions and emergent challenges include:
- Formally certified robustness for audio models, as interval bound propagation and randomized smoothing remain immature for long sequential audio modes (Olivier et al., 2022).
- Universal transferability across architectures and over-the-air execution, particularly as model variety and physical environment complexity increase.
- Scaling adversarial training, especially for SSL-pretrained models spanning vast acoustic domains.
- Semantic-level defenses (e.g., cross-modal agreement, multi-stage input validation) in multi-modal compositional pipelines (Kim et al., 5 Aug 2025, Ziv et al., 29 Dec 2025).
- Watermarking/adversarial detection embedded in audio encoders and decoders.
- Practical trade-offs between imperceptibility, attack efficacy, and computational resources—fast (FAAG (Miao et al., 2022)), perceptually optimal (psychoacoustic (Wang et al., 2020)), or attack-class adaptive.
Continued co-evolution of attack and defense strategies is anticipated as speech, audio, and multimodal models further permeate security- and safety-critical domains.