Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Published 17 Apr 2026 in cs.CR and cs.SD | (2604.16659v1)

Abstract: Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using external reference encoders alongside each model's own internal encoder, we show that benign fine-tuning elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12%. Crucially, the dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned -- determined by how each model's encoder and projector transform audio into the LLM's input space. We propose two defenses: filtering training data to maximize distance from harmful embeddings, and a textual system prompt at inference, both reducing JSR to near-zero without architectural modification. Our mechanistic analysis on two architectures reveals that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned, mirroring the behavioral asymmetries across modalities. Safety degradation from benign fine-tuning is a qualitatively distinct risk in Audio LLMs.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that benign fine-tuning using proximity-based selection can increase the Jailbreak Success Rate in Audio LLMs from 4.62% to as high as 87.12%.
It employs embedding-based filters across semantic, acoustic, and mixed axes to reveal architecture-dependent vulnerabilities in models like Audio Flamingo, Kimi-Audio, and Qwen2.5-Omni.
The study proposes defense strategies such as distant filtering and the use of textual system prompts to restore safety alignment following degradation.

Safety Degradation from Benign Fine-Tuning in Audio LLMs

Motivation and Context

Recent progress in audio LLMs (Audio LLMs) has enabled their deployment across diverse tasks encompassing speech question answering, dialogue, and complex audio-centric reasoning. Fine-tuning these models with user-supplied, ostensibly benign audio data has become a standard customization protocol. Prior work in text and vision domains has demonstrated that fine-tuning aligned LLMs can compromise safety, even when the training data is non-adversarial. However, these analyses are agnostic to modality-specific structural properties. The paper "Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs" (2604.16659) conducts the first comprehensive study of modality-conditioned safety erosion, showing that benign audio fine-tuning can dramatically elevate Jailbreak Success Rate (JSR) in state-of-the-art Audio LLMs, with the degradation controlled by architecture-dependent representational proximity.

Figure 1: Embedding-based filtering selects benign audio samples closest to harmful embeddings; fine-tuning on these proximate subsets causes dramatic safety alignment breakdown, elevating JSR from 4.62% to 87.12%.

Methodological Framework

Embedding-Based Proximity Filtering

The study develops a filtering protocol based on pairwise cosine distance between benign and harmful audio embeddings extracted by either model-internal or external reference encoders. Filtering axes include:

Semantic (Sentence-BERT): Captures text-level proximity via transcription and sentence embedding.
Acoustic (WavLM): Emphasizes speaker identity, prosody, and other low-level features.
Mixed (Whisper-Large-V3): Combines linguistic and acoustic content.

Benign samples closest to harmful content in the embedding space are used for fine-tuning; the effect of proximity, dataset composition, and model architecture on safety are jointly evaluated.

Figure 2: For each benign sample $b_i$ , minimum cosine distance to all harmful samples is computed; the top-k closest and bottom-k farthest are used for dose-response and defense analysis.

Architectural Variants

Three notable Audio LLMs are examined:

Audio Flamingo 3 (AF3): Unified Whisper encoder followed by an MLP projector.
Kimi-Audio-7B-Instruct: Dual-encoder with quantization bottleneck and semantic token pipeline.
Qwen2.5-Omni: Pass-through architecture using Whisper-Large-V3 features.

These variations produce distinct representational spaces, with proximity axes weighted differently depending on input transformation.

Figure 3: t-SNE projection visualizes proximity and overlap between benign and harmful embeddings across encoder types; Whisper-V3 shows substantial intermixing, WavLM provides acoustic separation.

Main Empirical Findings

Safety Breakdown via Proximate Benign Fine-Tuning

Finetuning with benign audio selected for proximity to harmful prompts causes catastrophic safety collapse. For Kimi-Audio, model-internal and semantic filtering at 25% data elevate AdvBench JSR from 4.62% to 58.08% and 87.12%, respectively; AF3 and Qwen2.5-Omni exhibit similar directional shifts, though the dominant axis is architecture-dependent. Random fine-tuning increases JSR, but filtering for proximity is significantly more damaging.

The vulnerability axis (semantic vs. acoustic) that best predicts safety degradation is explicitly conditioned on the model's encoder design. For dual-encoder architectures (Kimi-Audio), text-semantic proximity dominates; for unified designs (AF3), mixed proximity is most predictive. Audio vs. text fine-tuning reveals cross-modal asymmetry: in AF3, audio fine-tuning increases JSR while text fine-tuning decreases it; Qwen2.5-Omni exhibits the opposite, explained by representational alignment to text-trained refusal boundaries.

Figure 4: JSR after semantic proximity-filtered fine-tuning as text vs. audio; architecture determines whether safety degrades more for audio or textual training on identical content.

Mechanistic Analysis

Safety degradation manifests as selective suppression of the late-layer refusal circuit in the LLM, while the frozen audio encoder preserves representation integrity. The suppression occurs along the pathway least covered by alignment training, mirroring behavioral divergence and confirming mechanistic vulnerability.

Figure 5: Refusal signal projection onto intervention direction across LLM layers; suppression after fine-tuning aligns with elevated JSR and architectural pathway.

Defense Strategies and Generalization

Two interventions are proposed:

Distant Filtering: Training on benign audio farthest from harmful content in embedding space robustly preserves safety (JSR near-zero), especially for models with compressive projectors.
Textual System Prompt: Prepending explicit refusal instructions at inference restores alignment without any architectural modifications; JSR drops to baseline levels even after safety-compromising fine-tuning.

These approaches require only data preprocessing or prompt engineering and can be deployed across architectures.

Practical and Theoretical Implications

The findings challenge the assumption that safety alignment in Audio LLMs is robust to benign customization. Fine-tuning with benign, content-inspectable data can produce compliance with harmful prompts if encoder representations are proximate, independent of intent or surface-level semantics. This effect is not observable in text and vision LLMs due to their shared parameter pathways. The explicit decomposition into semantic and acoustic axes reveals that content moderation and standard screening cannot protect against proximity-based risk; calibration must be architecture-aware. Mechanistic evidence shows that late-layer refusal suppression, rather than encoder drift, drives behavioral misalignment.

Practically, filtering for embedding distance to harmful content and leveraging system prompts should be incorporated into fine-tuning workflows for Audio LLMs. Theoretical directions include automated detection of proximity-based data vulnerabilities, integration of audio-specific safety boundaries into alignment training, and generalization to non-speech modalities (music or environmental audio).

Future Directions

The study is limited to English speech datasets and single-turn interactions; extension to multilingual and conversational settings is needed. Robustness under adversarial audio perturbations remains to be systematically characterized. Unfreezing audio encoders, exploring chain-of-thought reasoning tasks, and evaluating broader utility preservation under safety adaptation are key next steps.

Conclusion

Benign fine-tuning in Audio LLMs presents a modality-specific, architecture-conditioned alignment risk not observed in text or vision domains. Proximity to harmful content in encoder representation space predicts safety erosion, with JSR rising from single digits to 87.12%. Defenses based on distant filtering and system prompts effectively restore alignment. As Audio LLM customization proliferates, data screening and modality-aware safety evaluation are critical for robust deployment.

Markdown Report Issue