Llama Guard 3 Vision

Updated 5 November 2025

Llama Guard 3 Vision is a multimodal content moderation tool that integrates image and text inputs to detect violations across 13 hazard categories.
It leverages the Llama 3.2 11B Vision backbone with a sophisticated encoder to process images rescaled to 560x560 pixels, achieving state-of-the-art performance on moderation benchmarks.
The system is fine-tuned using hybrid datasets and adversarial training, ensuring robust safety against both text and image-based attack strategies.

Llama Guard 3 Vision is a multimodal content moderation model designed specifically to safeguard human-AI conversations involving image understanding. It extends the previously text-only Llama Guard frameworks to robustly classify and filter both multimodal prompts (image + text) and responses (text, contextually grounded on images) produced by advanced vision-LLMs. Built atop the Llama 3.2 Vision 11B backbone, it is fine-tuned to detect violations across 13 hazard categories from the MLCommons taxonomy, demonstrating strong performance and resilience under adversarial attack scenarios.

1. Architectural Foundations and Multimodal Extension

Llama Guard 3 Vision builds on the Llama 3.2 11B Vision model, which incorporates a high-capacity vision encoder capable of processing images rescaled and chunked to $560 \times 560$ pixels. Images are embedded and fused with the textual tokens, supporting a single image per moderation request. Earlier Llama Guard versions (1, 2, 3-1B, 3-8B) were limited to text-only moderation. In contrast, Llama Guard 3 Vision jointly reasons over both modalities, enabling enforcement of safety policies on multimodal prompts and their corresponding responses. The model architecture is adapted to accept flexible input sequences, comprising hazard guideline lists, image and text prompts, and conversation history.

2. Training Regimen and Hazard Taxonomy Alignment

Supervised fine-tuning is performed on Llama 3.2 11B Vision, targeting the classification of each input as “safe” or “unsafe”, and, in the latter case, identifying the subset of violated categories. The training utilizes hybrid datasets with annotated samples that include both (i) human-authored prompts paired with real or synthetic images, and (ii) model-generated responses, explicitly including cases induced by jailbreaking strategies. Labels are human-annotated or, when needed, supplied via the Llama 3.1-405B model.

Text-only moderation data is incorporated using dummy image placeholders, improving cross-modal generalization. All classification targets are derived from the MLCommons hazard taxonomy, comprising:

Violent Crimes
Non-Violent Crimes
Sex Crimes
Child Sexual Exploitation
Defamation
Hate
Privacy
Intellectual Property
Elections
Indiscriminate Weapons
Specialized Advice
Self-Harm
Sexual Content

The model is trained with a standard cross-entropy loss, over input sequences up to 8192 tokens, and with a learning rate of $1 \times 10^{-5}$ .

3. Content Moderation: Dual-Mode Classification

Llama Guard 3 Vision performs both “prompt classification” (screening user-provided image + text) and “response classification” (screening model-generated responses in the context of the original multimodal prompt). Modulation of when and how to apply each is application-dependent. Notably, response classification is empirically more robust to ambiguity and adversarial attack.

The moderation interface outputs a binary safe/unsafe flag. On unsafe classification, the specific hazard categories involved are listed by index, aligning with the MLCommons taxonomy.

4. Benchmark Performance and Comparative Evaluation

Llama Guard 3 Vision demonstrates state-of-the-art performance on internal MLCommons-aligned benchmarks for both prompt and response moderation. When compared to proprietary models such as GPT-4o and GPT-4o mini, it achieves markedly higher F1 score and precision, particularly in response classification. The following table summarizes core performance metrics:

Task	Model	Precision	Recall	F1	FPR
Prompt Classification	Llama Guard 3 Vision	0.891	0.623	0.733	0.052
	GPT-4o	0.544	0.843	0.661	0.485
Response Classification	Llama Guard 3 Vision	0.961	0.916	0.938	0.016
	GPT-4o	0.579	0.788	0.667	0.243

On a per-hazard-category basis (response classification), F1 scores are uniformly high (all >0.69), with particularly strong performance on critical categories such as Defamation (0.967), Indiscriminate Weapons (0.995), and Elections (0.957).

5. Adversarial Robustness and Threat Model

Robustness was rigorously evaluated under stringent white-box adversarial settings:

Image-side Attacks: Using Projected Gradient Descent (PGD), adversarial perturbations can increase misclassification rates for prompt classification from 21% (clean) to 70% (attacked) at $\ell_\infty=8/255$ . Response classification remains substantially more robust (6% to 22–27%).
Text-side Attacks: Appending optimized suffixes (using GCG strategies) can elevate prompt misclassification from 4% to 72%. However, response classification demonstrates mitigated vulnerability unless the attacker controls the generated response.

Technical details for these attacks include PGD step size $\alpha=0.1$ , up to 100 iterations, and GCG with search width 64/top 32 token substitutions. The data indicates that dual-mode safeguard (prompt and response screening) is advisable, and response classification is to be preferred in ambiguous or adversarial settings.

6. Integration into Vision-LLM Pipelines

Llama Guard 3 Vision is designed to function as a system-level safeguard for any multimodal LLM, particularly those built on the Llama 3.2 Vision backbone. Images are preprocessed and embedded alongside textual tokens; moderation is performed directly on the full context (image plus dialogue). The model is practical for filtering both inputs (to protect the model from unsafe prompts) and outputs (to guard users from unsafe responses), thereby serving as a modular content moderation layer.

Recent research demonstrates that adaptive, knowledge-driven adversarial jailbreaks (e.g., those generated with GUARD (Jin et al., 2024), GUARD-JD (Jin et al., 28 Aug 2025)) can induce multimodal vision-LLMs, including those protected by standard classifiers, to generate hazardous output. These jailbreaks leverage natural, semantically aligned scenarios to circumvent surface-level filtering. This necessitates continual adversarial evaluation and potential augmentation of the safeguarding strategy.

7. Limitations and Deployment Considerations

Modalities: Llama Guard 3 Vision is currently optimized for single-image, English-only moderation. Each moderation sample supports one image, resized and chunked as required by the backbone. The system is not designed to support image sequences, video, or multilingual input out-of-the-box.
Taxonomy Alignment: Some hazards (Defamation, Intellectual Property, Elections) require up-to-date or external world knowledge not natively captured by the model.
Adversarial Limits: Despite strong resistance, no system is impervious. Adversarial training, input constraints, and anomaly/perplexity detection are recommended to mitigate emerging threats.
Application Domains: Best suited for high-risk scenarios (social media moderation, conversational assistants, educational settings) where visual context is integral.

Summary Table: Llama Guard 3 Vision Capabilities

Dimension	Implementation Highlight
Foundation	Llama 3.2 11B Vision, vision encoder + multimodal fusion
Modalities	Multimodal (image + text), one image per moderation call
Safeguard Target	Prompt classification, response classification
Hazard Taxonomy	MLCommons (13 categories)
Benchmarking	Outperforms GPT-4o/mini on response moderation (F1/Precision)
Adversarial Test	Strong resilience, but not invulnerable
Language Support	English (other languages not supported)

Llama Guard 3 Vision establishes a new standard for open-source, category-precise, multimodal content moderation. Through integration with vision-LLM stacks and comprehensive adversarial evaluation, it enables more secure and reliable human-AI interactions in increasingly multimodal digital environments (Chi et al., 2024).