Papers
Topics
Authors
Recent
Search
2000 character limit reached

Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models

Published 31 Oct 2024 in cs.CL, cs.MM, cs.SD, and eess.AS | (2410.23861v1)

Abstract: Large Multimodal Models (LMMs) have demonstrated the ability to interact with humans under real-world conditions by combining LLMs and modality encoders to align multimodal information (visual and auditory) with text. However, such models raise new safety challenges of whether models that are safety-aligned on text also exhibit consistent safeguards for multimodal inputs. Despite recent safety-alignment research on vision LMMs, the safety of audio LMMs remains under-explored. In this work, we comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks. Our results under these settings demonstrate that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions, and exhibit safety vulnerabilities when distracted with non-speech audio noise. Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark. We provide insights on what could cause these reported safety-misalignments. Warning: this paper contains offensive examples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Qwen technical report. CoRR, abs/2309.16609.
  3. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
  4. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Qwen2-audio technical report. CoRR, abs/2407.10759.
  7. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. CoRR, abs/2311.07919.
  8. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858.
  9. Figstep: Jailbreaking large vision-language models via typographic visual prompts. CoRR, abs/2311.05608.
  10. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. CoRR, abs/2406.07594.
  11. Walking a tightrope - evaluating large language models in high-risk domains. CoRR, abs/2311.14966.
  12. Llama guard: Llm-based input-output safeguard for human-ai conversations. CoRR, abs/2312.06674.
  13. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 3923–3954. Association for Computational Linguistics.
  14. Red teaming visual language models. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 3326–3342. Association for Computational Linguistics.
  15. Deepinception: Hypnotize large language model to be jailbreaker. CoRR, abs/2311.03191.
  16. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
  17. Can llms follow simple rules? CoRR, abs/2311.04235.
  18. Speechguard: Exploring the adversarial robustness of multimodal large language models. CoRR, abs/2405.08317.
  19. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  20. Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety. CoRR, abs/2404.05399.
  21. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
  22. SALMONN: towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
  23. ALERT: A comprehensive benchmark for assessing large language models’ safety through red teaming. CoRR, abs/2404.08676.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  25. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  26. White-box multimodal jailbreaks against large vision-language models. CoRR, abs/2405.17894.
  27. Do-not-answer: A dataset for evaluating safeguards in llms. CoRR, abs/2308.13387.
  28. SELF-GUARD: empower the LLM to safeguard itself. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 1648–1668. Association for Computational Linguistics.
  29. Cvalues: Measuring the values of chinese large language models from safety to responsibility. CoRR, abs/2307.09705.
  30. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. In Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 3526–3548. Association for Computational Linguistics.
  31. Yue Xu and Wenjie Wang. 2024. Linkprompt: Natural and universal adversarial attacks on prompt-based language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 6473–6486. Association for Computational Linguistics.
  32. Jigsaw puzzles: Splitting harmful questions to jailbreak large language models. arXiv preprint arXiv:2410.11459.
  33. Investigating Pre-trained Audio Encoders in the Low-Resource Condition. In Proc. INTERSPEECH 2023, pages 1498–1502.
  34. Jailbreak vision language models via bi-modal adversarial prompt. CoRR, abs/2406.04031.
  35. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 14322–14350. Association for Computational Linguistics.
  36. Spa-vl: A comprehensive safety preference alignment dataset for vision language model. arXiv preprint arXiv:2406.12030.
  37. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
  38. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
  39. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (1)

Summary

  • The paper identifies critical safety vulnerabilities in audio multimodal models using a red teaming framework with an average attack success rate of 69.14%.
  • It employs systematic evaluations across five models and introduces an audio segmentation jailbreak that compromised Gemini-1.5-Pro, achieving a 70.67% success rate.
  • The findings emphasize the need for robust, adaptive multimodal frameworks to mitigate harmful audio inputs and non-speech distractions for improved security.

Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models

Introduction

The paper "Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models" (2410.23861) presents a comprehensive exploration of the safety vulnerabilities inherent in Large Multimodal Models (LMMs), particularly those integrating auditory inputs. While LMMs, which amalgamate LLMs with modality encoders for visual and auditory understanding, have opened new avenues for real-world applications, they concurrently introduce significant safety challenges. This study emphasizes the under-explored domain of audio LMMs, compared to the more addressed vision LMMs, in terms of safety alignment.

Methodology

The authors undertake a red teaming approach to evaluate the safety of five advanced audio LMMs: Qwen-Audio, Qwen2-Audio, SALMONN-7B, SALMONN-13B, and Gemini-1.5-Pro. The exploration is categorized into three distinct scenarios:

  1. Harmful Questions in Audio and Text Formats: This examines the models' responses to potentially dangerous inquiries presented both as audio and text.
  2. Non-Speech Audio Distractions: This assesses the impact of irrelevant, non-speech audio added to harmful text queries.
  3. Speech-Specific Jailbreaks: This investigates strategies for bypassing predefined safeguarding using audio-based manipulations.

Results and Discussion

The red teaming experiments reveal significant findings. The open-source audio LMMs demonstrate an average attack success rate of 69.14% when faced with harmful audio questions, unveiling serious safety lapses. Figure 1

Figure 1: The percentage of harmful/safe responses beginning with specific words (\%).

Interestingly, Gemini-1.5-Pro, with its sophisticated safety filters, exhibits a robust defense against harmful inputs, achieving a near 0% attack success rate under plain harmful conditions. However, the study highlights that even these models with advanced safeguards can be compromised when targeted with cleverly constructed speech-specific attacks. Notably, a newly proposed jailbreak strategy that decomposes harmful terms into audio segments substantially circumvents Gemini's defenses, achieving a 70.67% success rate. Figure 2

Figure 2: The ASR-q for non-speech audio input injections across different audio lengths (2-14 seconds).

Furthermore, when examining the effects of non-speech audio distractions, the authors find that introducing arbitrary noises significantly destabilizes the safety measures in place, altering the models' representation spaces and increasing their vulnerability to adversarial attacks. Such noise introduces up to a 32.58% variation in the attack success rate, clearly indicating that existing defenses are insufficient in handling audio modality's complexity.

Visual Representation Analysis

The paper also explores the representation space dynamics of these models. The visualizations using t-SNE illustrate how harmful and benign inputs are mapped within the network's representation space, offering insights into the models' decision-making processes. Figure 3

Figure 3: The shape of representation space on SALMONN-7B under various length of input silence audio. "0s" denotes no audio input.

The distortion introduced to these spaces by non-speech audio, such as silence of varying lengths, demonstrates a clear lack of robustness in the models' auditory processing capabilities.

Future Directions

The implications of this research are twofold. Practically, it underscores the necessity for developing more resilient multimodal frameworks that can anticipate and mitigate jailbreak attempts via both speech and noise. Theoretically, it invites further investigation into integrating multimodal safety mechanisms that harmonize audio, text, and visual inputs without compromising their collective security alignment.

Conclusion

This paper shines a light on the critical yet under-researched domain of audio LMM safety. It demonstrates both the potential and the pitfalls of these advanced models. As AI systems increasingly integrate multimodal capabilities, ensuring their safe operation remains paramount. The study advocates for continued research into adaptive safeguarding strategies, particularly in handling the nuanced vulnerabilities that arise from auditory inputs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.