Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

Published 16 Apr 2026 in cs.CR, cs.AI, and cs.SD | (2604.14604v1)

Abstract: Modern Large audio-LLMs (LALMs) power intelligent voice interactions by tightly integrating audio and text. This integration, however, expands the attack surface beyond text and introduces vulnerabilities in the continuous, high-dimensional audio channel. While prior work studied audio jailbreaks, the security risks of malicious audio injection and downstream behavior manipulation remain underexamined. In this work, we reveal a previously overlooked threat, auditory prompt injection, under realistic constraints of audio data-only access and strong perceptual stealth. To systematically analyze this threat, we propose \textit{AudioHijack}, a general framework that generates context-agnostic and imperceptible adversarial audio to hijack LALMs. \textit{AudioHijack} employs sampling-based gradient estimation for end-to-end optimization across diverse models, bypassing non-differentiable audio tokenization. Through attention supervision and multi-context training, it steers model attention toward adversarial audio and generalizes to unseen user contexts. We also design a convolutional blending method that modulates perturbations into natural reverberation, making them highly imperceptible to users. Extensive experiments on 13 state-of-the-art LALMs show consistent hijacking across 6 misbehavior categories, achieving average success rates of 79\%-96\% on unseen user contexts with high acoustic fidelity. Real-world studies demonstrate that commercial voice agents from Mistral AI and Microsoft Azure can be induced to execute unauthorized actions on behalf of users. These findings expose critical vulnerabilities in LALMs and highlight the urgent need for dedicated defense.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that adversarial audio injections can hijack LALMs with success rates up to 96% across diverse misbehavior categories.
It introduces AudioHijack, combining gradient estimation, multi-context training, and convolutional perturbation blending to craft imperceptible audio prompts.
The study reveals that standard defenses often fail, highlighting the need for robust behavior-level and attention-based auditing in multimodal systems.

Systematic Analysis of Imperceptible Auditory Prompt Injection in Large Audio-LLMs

Introduction

This work rigorously characterizes and quantifies a new threat vector—indirect, context-agnostic auditory prompt injection—for Large Audio-LLMs (LALMs). LALMs, which power contemporary voice assistants and multimodal conversational agents, inherently expand the attack surface into the audio domain by tightly integrating high-dimensional, continuous audio data with textual interfaces. Prior studies have focused on direct audio-based jailbreaks, but the security implications of malicious audio injection that manipulates LALM downstream behavior covertly, with minimal perceptual impact and without full control over the user input, have remained underexplored.

This paper introduces AudioHijack, a generic attack framework that enables highly stealthy and context-robust adversarial audio construction, overcoming key obstacles posed by LALM architecture heterogeneity, context opacity, and perceptual imperceptibility requirements. Through large-scale empirical evaluation on open-source and commercial LALMs, the work demonstrates that covert adversarial audio can achieve high success rates (up to 96%) across a spectrum of misbehavior categories, including unauthorized tool invocation, phishing, disinformation, and denial-of-service behaviors. The practical and theoretical implications underscore fundamental vulnerabilities in the current multimodal alignment paradigm for LALMs and expose challenges for effective detection and mitigation strategies.

Auditory Prompt Injection: Threat Model and Problem Formulation

The threat model considers a third-party adversary with audio data-only access, unable to observe or control user instructions (text or voice). The adversary seeks to induce LALMs to exhibit target misbehaviors irrespective of the user's context, with the injected audio being imperceptible to humans. This reflects real-world scenarios such as media transcription, multi-participant conferencing, or content ingest from external sources.

Figure 1: Schematic threat model for context-agnostic auditory prompt injection, illustrating adversary access and attack surfaces.

Formally, the attack objective is to generate an adversarial audio $\hat{x}_d$ from benign carrier $x_d$ such that, for any user context $x_c$ , the LALM output exhibits a specific target behavior $B$ , while $\hat{x}_d$ remains perceptually similar to $x_d$ . The challenge is compounded by heterogeneous and often non-differentiable audio-text integration schemes (discrete tokens, continuous embeddings, hybrid), unknown user instructions, and stringent stealth requirements.

AudioHijack: General Attack Framework

Sampling-based Gradient Estimation

Traditional adversarial example generation is inhibited in LALMs by non-differentiable tokenization and quantization layers. AudioHijack introduces a differentiable surrogate using Gumbel-Softmax sampling, allowing soft, backproppable approximation of discrete token selection and embedding lookup. This enables end-to-end optimization for both discrete and hybrid LALM architectures, bypassing previous gradient obstructions.

Attention-Guided Context Generalization

Context opacity—where arbitrary, unseen user input can dilute or override adversarial effects—is overcome by multi-context training combined with explicit attention supervision. By optimizing adversarial audio over a diverse auxiliary dataset of instructions and directly shaping model attention weights, the attack is robustly transferred to previously unseen conversational settings, increasing efficacy against contextually sensitive LALMs.

Figure 2: AudioHijack framework pipeline integrating gradient estimation, multi-context generalization, and convolutional blending.

Convolutional Perturbation Blending

To reconcile the tension between stealth and attack power, the perturbation is applied via learnable convolutional kernels inspired by real-world reverberation, rather than simple additive noise. This spectral and temporal blending approach produces adversarial examples that are spectrally indistinguishable from natural reverberated audio, as validated by objective metrics (SNR, MCD, PESQ, STOI) and visual spectral analysis.

Empirical Evaluation

Attack Generality and Success

AudioHijack is evaluated against 13 state-of-the-art LALMs covering discrete, continuous, and hybrid schemes, including commercial-scale production agents. Across six misbehavior categories—auditory blindness, refusal, disinformation, phishing, persona control, and tool misuse—the framework achieves average success rates of 0.79–0.96 even under strict imperceptibility constraints, consistently outperforming additive or poorly blended baselines.

Figure 3: Model-wise BMSR and PISR for non-tool-use misbehaviors. High and uniform vulnerability is observed across LALMs.

Figure 4: Robustness to unseen user contexts—attack success shows minimal degradation with increasing instruction length or modality shifts.

Practical studies using commercial APIs (Mistral AI Voxtral, Microsoft Azure Phi-4-Multimodal) confirm transferability: locally crafted adversarial audio successfully triggers unauthorized actions including search, file download, and information exfiltration on deployed agents.

Figure 5: Example of a real-world successful attack; adversarial audio triggers unintended assistant behavior in commercial LALM.

Stealth Analysis

Convolutional blending achieves SNR above 28 dB, MCD < 4.2, and PESQ up to 3.16 for speech carriers, outperforming $L_\infty$ and $L_2$ additive attacks which produce conspicuous artifacts. Stealth is further enhanced when music carriers are used, owing to their inherent reverberation.

Figure 6: Attack stealth is robust to auxiliary context data size; strong performance is maintained with as few as 20 auxiliary instructions.

Figure 7: Impact of carrier audio duration on attack success—minimal length required for high BMSR is around 10 seconds.

Ablation and Sensitivity

The efficacy is robust to reductions in auxiliary instruction sets (down to 20 samples) and to carrier audio length (effective from 7.5 seconds). Sensitivity to sampling temperature is observed, with degraded performance only at unusually high generation randomness.

Figure 8: Sampling temperature influence; attack degrades gracefully as decoding becomes stochastic.

Defense and Detection

Defense analysis reveals that prompt-level (in-context), response-level (self-reflection), and logit-level (logits divergence, WaveGuard) detection mechanisms are systematically evaded by AudioHijack. Injection works by hijacking attention to the adversarial audio, resulting in high-confidence and context-consistent outputs that escape surface-level anomaly detection.

Figure 9: Most standard prompt injection defenses fail to significantly reduce the behavior match success rate.

Fine-grained attention-based detection—specifically, inspection of the deviation in attention distribution between audio data and user context—offers stronger separation (precision 0.98, recall 0.93 in non-adaptive scenarios). However, adaptive attackers can reduce detectability by minimizing attention manipulation at the cost of reduced attack power, exposing an effectiveness-detectability trade-off.

Figure 10: ROC for logits-based divergence defense is suboptimal compared to attention-based detection.

Figure 11: PCA visualization shows clean separation between adversarial and benign samples via attention deviation features.

Implications and Future Directions

This work exposes fundamental security limitations in the prevailing design of audio-language integration. The demonstrated vulnerabilities are architecture-agnostic, highly transferable, and achievable with low-bandwidth, stealthy perturbations. The results imply that mere adversarial training or surface-level monitoring is insufficient; deep behavior-level and attention-level auditing are required.

The threat highlights the urgent necessity for multimodal defense research—robust training against covert prompt injection, real-time internal state monitoring, and potentially architectural redesigns that deconfound instruction/data boundaries in audio streams.

Future directions include development of black-box and gradient-free attack variants, benchmarking broader misbehavior spectra, and the study of hardware/sensor-level mitigation (e.g., perturbation filtering, adversarial provenance tracing).

Conclusion

AudioHijack sets a new standard for adversarial prompt injection research in LALMs, exposing the breadth and stealth with which model behavior can be covertly subverted. Its findings challenge both the security-by-design and defense paradigms dominant in LALM deployments and call for the adoption of more sophisticated, behavior-aware detection and mitigation pipelines. The research clearly demonstrates that, without behavioral and attention-based auditing, the current generation of voice-enabled LLMs is systematically vulnerable to imperceptible, context-agnostic hijacking attacks (2604.14604).