MLLMGuard: Safeguarding Multimodal Models

Updated 6 January 2026

MLLMGuard is a comprehensive framework that integrates architectures, benchmarks, and plug-and-play systems to safeguard multimodal LLMs from adversarial inputs.
It employs techniques like noise aggregation, prompt engineering, and chain-of-thought reasoning to filter and mitigate harmful visual, audio, and textual stimuli.
The framework leverages evaluation suites and continuous red-teaming to balance safety measures with model utility across speech, image, and text modalities.

Multimodal LLM Guard (MLLMGuard) refers to the entire class of frameworks, architectures, benchmarks, and plug-and-play systems designed to systematically safeguard Multimodal LLMs (MLLMs) from adversarial or harmful inputs—including compositional jailbreak prompts, malicious visual/audio stimuli, and unsafe query-response interactions—while striving to maintain model utility and performance. MLLMGuard encompasses both integrated safety modules and external evaluation suites, spanning input/output filtering, prompt engineering, classifier heads, reasoning pipelines, and adversarial stress tests, with design choices explicitly targeting the unique cross-modal vulnerabilities of MLLMs.

1. Threat Landscape for Multimodal LLMs

MLLMs dramatically expand conventional LLM attack surfaces by simultaneously processing text, image, audio, and speech modalities, exposing new classes of vulnerabilities. Modern threat models center on compositional and cross-modal attacks:

Speech-Audio Composition Attacks: SACRED-Bench (Yang et al., 13 Nov 2025) demonstrated attacks where malicious speech is overlapped or mixed with benign audio. For instance, speech-overlap embeds harmful prompts beneath carrier speech, while multi-speaker dialogues mask prohibited instructions within conversational cues.
Visual Adversarial Attacks: MLLM-Protector (Pi et al., 2024), UniGuard (Oh et al., 2024), and SmoothGuard (Su et al., 29 Oct 2025) treated images as continuous, "foreign language" signals that evade text-centric alignment and can be adversarially perturbed to elicit unsafe outputs.
Prompt Injection and Jailbreaks: QGuard (Lee et al., 14 Jun 2025), GUARD (Jin et al., 2024, Jin et al., 28 Aug 2025), and Beaver-Guard-V (Ji et al., 22 Mar 2025) target directives that skirt textual safety filters, including obfuscated queries, role-play scenarios, and token-level prompt engineering.
Cross-modal and Relational Hazards: RapGuard (Jiang et al., 2024) highlighted dangers where the combination of images/text/speech leads to emergent hazardous interpretations not present in unimodal inputs.

The attack success rates (ASR) against state-of-the-art models—e.g., Gemini 2.5 Pro ASR 66% under SACRED-Bench compositional audio attacks (Yang et al., 13 Nov 2025), ~70% under malicious visual queries (Pi et al., 2024)—underscore the urgency for multimodal-specific defense.

2. Frameworks and Architectural Paradigms

MLLMGuard solutions exhibit diverse interfaces: model-integrated safety heads, prompt-based guards, evaluation suites, and plug-and-play utility modules.

Representative Systems

Guard System	Modalities	Key Mechanism	Deployment Mode
SALMONN-Guard	Speech, Text, Audio	Cross-modal transformer + safety head	Model-integrated
QGuard	Text, Image, Audio	Zero-shot guard-questions, graph filtering	Front-end
SmoothGuard	Image, Audio, Text	Noise smoothing + clustering aggregation	Inference plug-in
Llama Guard 3 Vision	Image, Text	Trainable classification heads	Model-integrated
UniGuard	Image, Text	PGD noise (image) + gradient token search (text)	Plug-and-play
Beaver-Guard-V	Image, Text	Multi-level binary/severity classifier, regeneration	Pipeline controller
RapGuard	Image, Text	Chain-of-thought rationale + adaptive prompts	Prompting/inference

Each framework targets different cross-modal interactions and technical assumptions: SALMONN-Guard uses cross-modal attention and alignment gating for detailed inspection of speech and audio signals (Yang et al., 13 Nov 2025); QGuard constructs a bank of guard questions and aggregates MLLM responses using PageRank-based risk metrics (Lee et al., 14 Jun 2025); SmoothGuard applies randomized smoothing to image/audio and clusters output embeddings at inference (Su et al., 29 Oct 2025); UniGuard parameterizes additive guardrails for image/text via PGD and token-gradient search (Oh et al., 2024); RapGuard generates a stepwise safety rationale and adapts defensive prompting on a per-query/scene basis (Jiang et al., 2024).

3. Training Paradigms, Inference Strategies, and Losses

MLLMGuard systems leverage a spectrum of training and inference approaches, tuned to multimodal safety:

Supervised Safety Labeling: Fine-tuning with labeled harmful/benign examples, often sourced from red-teaming pools with compositional attacks (e.g., SALMONN-Guard trains on SACRED-Bench comprising 10K audio/text samples and cross-dataset holdout for attack styles (Yang et al., 13 Nov 2025)).
Unimodal and Cross-Modal Fusion: Modular encoders project signals (e.g., log-mel spectrograms, patch embeddings) into a joint feature space, with cross-attention and alignment layers optimizing for scenario-dependent gating (cf. Qwen2.5-Omni-7B backbone in SALMONN-Guard, Llama 3.2-Vision backbone for Llama Guard 3 Vision).
Zero-shot Guarding via Prompting: QGuard builds a set of grouped guard questions, evaluating each input via concatenated prompt and outputting a risk score with graph-based metrics (PageRank, edge weights) (Lee et al., 14 Jun 2025).
Noise Aggregation Defenses: SmoothGuard injects isotropic Gaussian noise into continuous modalities, runs the model N times, and selects the majority cluster (k-means) in embedding space as the robust output (Su et al., 29 Oct 2025).
Min–Max Constrained Reinforcement Learning: Beaver-Guard-V and Safe RLHF-V integrate a dual-objective RLHF framework, maximizing helpfulness subject to explicit safety penalties, formalized using a Lagrangian min–max optimization (Ji et al., 22 Mar 2025).
Role-play and Rationale-aware Prompting: GUARD orchestrates role-play agents to iteratively manipulate and evaluate jailbreak scenarios, while RapGuard auto-generates multimodal rationales and adapts defensive safety prompts per input (Jiang et al., 2024, Jin et al., 2024).

Optimization objectives typically combine cross-entropy for harmful/safe classification, regularization (e.g., ℓ2 norm), and custom terms targeting scenario-specific safety decisions.

4. Benchmarks, Evaluation Suites, and Metrics

Benchmarking and safety evaluation are central components:

SACRED-Bench: Targets speech/audio composition, containing 10K samples with overlap, mixture, and multi-format QA (Yang et al., 13 Nov 2025).
MM-SafetyBench: Tests multimodal jailbreak resistance across 13 scenarios and variants, supporting image/text/audio and diverse attack types (Pi et al., 2024, Su et al., 29 Oct 2025).
MLLMGuard Bilingual Suite: Provides 2,282 curated bilingual (English/Chinese) image-text prompts, covering five safety dimensions (Privacy, Bias, Toxicity, Truthfulness, Legality), annotated with fine-grained responses, and integrates the GuardRank automated evaluator (Gu et al., 2024).
BeaverTails-V: Features dual preference annotation and multi-level harm severity labels, undergirding RLHF-based guard training and ablation (Ji et al., 22 Mar 2025).
GUARD Compliance and Jailbreak Diagnostics: Translates high-level guidelines into actionable violation prompts and role-play jailbreak templates, collates violation rates and toxicity metrics across modalities (Jin et al., 28 Aug 2025).

Key metrics include Attack Success Rate (ASR), Defense Success Rate, F1 (classification), accuracy on benign tasks, Harmless Rate (HR), Perfect Answer Rate (PAR), and risk scores derived via graph filtering or chain-of-thought reasoning. Automated classifiers (e.g., GuardRank (Gu et al., 2024)) are frequently used for efficiency and reproducibility.

5. Implementation, Deployment, and Integration Considerations

Integration of MLLMGuard systems spans model-integrated modules, inference-time plug-ins, and front-end wrappers:

Inference Overhead: Noise aggregation (SmoothGuard) and multi-round filtering pipelines (Beaver-Guard-V) can induce significant latency (e.g., SmoothGuard requires N=10 forward passes; Beaver-Guard-V incurs up to 5× latency for five filter/regenerate cycles (Su et al., 29 Oct 2025, Ji et al., 22 Mar 2025)).
Plug-and-Play Compatibility: Frameworks like MLLM-Protector and UniGuard do not alter base model parameters, instead performing post-processing or input transformation (Pi et al., 2024, Oh et al., 2024).
Hardware Requirements: Model-integrated guards running on large backbones (Llama Guard 3 Vision uses an 11B parameter stack, requiring ~40 GB GPU memory (Chi et al., 2024)) may be unsuitable for real-time, low-resource applications; smaller distillations are suggested.
Dynamic Guard Management: QGuard allows continuous update and re-ranking of question banks for emerging threats, with robustness against new attack surface expansion (Lee et al., 14 Jun 2025). RapGuard and adaptive chain-of-thought methods generate scenario-specific prompts at runtime for maximized contextual safety (Jiang et al., 2024).
Cross-Language and Modality Generalization: Relatively few systems offer native multilingual or cross-modal coverage; MLLMGuard (Gu et al., 2024) and UniGuard (Oh et al., 2024) demonstrate bidirectional capability and scalability to future modalities.

A plausible implication is that deployment at scale will necessitate aggressive optimization for inference cost and scenario transferability, in addition to architectural flexibility for evolving attack typologies.

6. Current Limitations and Prospects for Future Development

Published systems acknowledge several persistent challenges:

Coverage versus Generalization: Systems such as SALMONN-Guard and Llama Guard 3 Vision report diminished efficacy on attacks outside their training distribution or on unseen languages/accent groups (Yang et al., 13 Nov 2025, Chi et al., 2024).
Modalities Beyond Images and Text: Most frameworks remain limited to image-text and speech-text; extension to audio/video/3D benchmarks is notably absent (Oh et al., 2024, Gu et al., 2024).
Adaptive and Novel Attack Resistance: Guard systems may be bypassed by highly novel attack patterns (e.g., voice-cloning adversarial audio, multi-turn jailbreaks, compositional chain-of-prompts) (Yang et al., 13 Nov 2025, Jin et al., 2024, Gu et al., 2024).
Safety–Performance Trade-off: Increases in safety often marginally reduce fluency or accuracy on benign inputs (see UniGuard and MLLM-Protector ablation tables); adaptive guardrail strength remains an open area (Oh et al., 2024, Pi et al., 2024).
Automated Evaluation Calibration: Thresholds for detection and refusal scoring (e.g., QGuard risk thresholds, GuardRank four-way scoring) are presently dataset-specific; improving out-of-domain and continual calibration is a prospective direction (Lee et al., 14 Jun 2025, Gu et al., 2024).

Proposed future work includes continual learning from online red-teaming, expansion of scenario-specific rationale templates, integrated chain-of-thought reasoning, augmentation with external safety classifiers, adaptive guardrail tuning via Lagrangian regularization, and extension to emerging modalities and jurisdictions (Gu et al., 2024, Jiang et al., 2024, Chi et al., 2024).

7. Comparative Summary and Practical Impact

MLLMGuard frameworks have produced statistically significant reductions in attack success rates (e.g., SACRED-Bench: Gemini 2.5 Pro ASR 66.75% to SALMONN-Guard 11.32% (Yang et al., 13 Nov 2025); Beaver-Guard-V: five rounds of filtering yield 40.9% relative reduction in unsafe outputs (Ji et al., 22 Mar 2025); RapGuard: ∼98% harmless rate across 9 threat scenarios (Jiang et al., 2024); SmoothGuard: Qwen2.5-VL-7B ASR 0.0345→0.0183 (Su et al., 29 Oct 2025)). Quantitative results and formal evaluation protocols underpin claims of enhanced multimodal safety, though real-world deployment will demand further advances in cross-domain generalization, evaluation scalability, and utility preservation. The integration of rational chain-of-thought, adaptive prompt engineering, and multi-round filtering stands as a technical frontier for robust, scalable MLLM guard architectures.