Llama-Guard-3: Efficient LLM Safeguards

Updated 25 January 2026

Llama-Guard-3 is a family of LLM-based safeguards designed to filter harmful content using robust input/output mechanisms and multilingual coverage.
It employs advanced pruning and quantization techniques to achieve significant model compression while retaining high moderation accuracy.
The system enables real-time, on-device moderation with a dual-layer safety pipeline and demonstrates resilience against adversarial attacks across modalities.

Llama-Guard-3 is a family of LLM–based safeguards designed to moderate and filter potentially harmful content in human–AI dialogues. Developed in conjunction with Meta’s Llama 3 herd of models (Grattafiori et al., 2024), Llama-Guard-3 encompasses a suite of architectures scaling from compact, quantized models suitable for on-device deployment to full-parameter models supporting high-resource moderation tasks. Key design goals include robust input/output filtering, efficient runtime performance, multilingual and modality-aware coverage, and resistance to adversarial bypass techniques (Fedorov et al., 2024, Chi et al., 2024, &&&3&&&, Krasnodębska et al., 19 Jun 2025). The system-level guardrail approach prioritizes accurate hazard categorization while minimizing latency and memory footprint.

1. Model Architectures and Compression Techniques

Llama-Guard-3 models begin with Llama 3.2 pre-trained decoder-only transformer architectures, ranging from 1B to 8B parameters. The flagship compact variant, Llama-Guard-3-1B-INT4 (“LG INT4”, Editor's term) (Fedorov et al., 2024), utilizes extensive network pruning and quantization for maximal efficiency:

Block Pruning: Removes redundant residual-update blocks by calculating layer importance via $E_{\mathcal{D}}\left[\frac{\langle x_{\mathrm{in}}, x_{\mathrm{out}} \rangle}{\|x_{\mathrm{in}}\|\,\|x_{\mathrm{out}}\|}\right]$ , reducing depth from 16 to 12 layers.
MLP Neuron Pruning: Narrows hidden-MLP width from 8,192 to 6,400 using $E_{\mathcal{D}}[h_k^2],\ k=1,\ldots,K$ for neuron scoring.
Unembedding Pruning: Restricts the output projection to 20 fixed tokens (e.g., “safe,” “unsafe,” and 14 hazard category labels)—from $2{,}048\times128\,k$ parameters to $2{,}048\times20$ .

After structural pruning, quantization-aware training (QAT) compresses all weights to 4 bits via $Q(\theta_g) = s_{\theta}\,\mathrm{clip}(\mathrm{round}(\theta_g/s_{\theta}),-8,7)$ ( $s_{\theta} = \frac{1}{7.5}\max_{i\in g}|\theta_{g,i}|$ ) and activations to 8 bits per token ( $Q(x_{\mathrm{in}}) = s_x\,\mathrm{clip}(\mathrm{round}((x_{\mathrm{in}}-z)/s_x),0,255) + z$ where $s_x = \frac{\max x_{\mathrm{in}} - \min x_{\mathrm{in}}}{255}$ , $z = \min x_{\mathrm{in}}$ ). Weight rounding of the embedding layer (group size 32) further reduces the model size from ∼2.1 GB (bf16) to 0.5 GB, and finally to 440MB post-unembedding. A distillation step using a larger LG 3-8B teacher corrects accuracy degradations (Fedorov et al., 2024).

The full-scale Llama-Guard-3-8B model (Krasnodębska et al., 19 Jun 2025, Grattafiori et al., 2024) integrates similar transformer backbones but retains higher capacity and flexible hazard category output spaces for applications prioritizing maximal moderation quality.

2. Safety Moderation Pipeline and Taxonomy

Llama-Guard-3 guards are deployed as a dual-layer fail-safe atop Llama 3 (and other LLMs). The typical pipeline introduces system-level input and output filters, both realized as safety classifiers:

Input Guard: $s_\mathrm{in}(x)$ estimates safety risk on $x$ ; $safe_\mathrm{in}(x) \equiv [s_\mathrm{in}(x)<\tau_\mathrm{in}]$ .
Output Guard: $s_\mathrm{out}(x,y)$ estimates safety risk on response $y$ given input $x$ ; $safe_\mathrm{out}(x,y) \equiv [s_\mathrm{out}(x,y)<\tau_\mathrm{out}]$ .

For both, classification is supervised over hazardous, borderline, and synthetic (“Rainbow Teaming”) prompts with cross-entropy loss plus optional regularization. Taxonomy coverage spans up to 14 language-based harm categories, such as Hate, Defamation, Self-Harm, Child Exploitation, Specialized Advice, Intellectual Property, Indiscriminate Weapons, Elections, among others (Grattafiori et al., 2024, Krasnodębska et al., 19 Jun 2025, Chi et al., 2024).

Empirical evaluation is performed using Violation Rate (VR; fraction of adversarial prompts eliciting unsafe output) and False Refusal Rate (FRR; fraction of innocuous prompts incorrectly refused), along with F1, precision, recall, and accuracy (Fedorov et al., 2024, Grattafiori et al., 2024, Krasnodębska et al., 19 Jun 2025).

3. Runtime Performance, Quantization, and Deployment

LG INT4 is optimized for on-device inference using ExecuTorch runtime and XNNPACK ARM-CPU delegate (Fedorov et al., 2024). It achieves:

Model	Size	Throughput (tokens/s)	TTFT (sec)
LG 3-1B (bf16)	~2.1GB	<10	>5
LG 3-1B-INT4 (INT4)	440MB	≥30	≤2.5

Quantization and pruning yield a ≈7× reduction in size, with 4-bit matrix multiplies and 8-bit activations accelerating computation. These techniques enable mobile deployment without accelerators and with minimal DRAM consumption. The primary trade-off is reduced output vocabulary flexibility, fixed to 20 tokens (Fedorov et al., 2024).

4. Empirical Safety Efficacy Across Languages and Modalities

On MLCommons AI Safety v0.5 taxonomy, LG INT4 matches or exceeds baseline moderation metrics of the full-precision sibling (Fedorov et al., 2024). For English, LG INT4 demonstrates F1=0.904 and FPR=0.084 compared to LG 3-1B’s F1=0.899 and FPR=0.090. In non-English languages, it maintains comparable efficacy, with small degradations in French (F1=0.873; FPR=0.072).

Llama-Guard-3-8B fine-tuned for Polish (Krasnodębska et al., 19 Jun 2025) achieves binary F1=0.889 on clean data and 0.782 under adversarial perturbations, with multiclass F1=0.563/0.507 respectively. Robustness—measured by the drop in F1 under character-level, diacritic-stripping, and OCR-style noise—rivals other LLM-based guards but is usually surpassed by smaller, BERT-derivative classifiers.

Multimodal protection is provided by Llama Guard 3 Vision (Chi et al., 2024), built on Llama 3.2-Vision (11B). It integrates a lightweight vision encoder (image patches, visual embeddings) with text tokens. Results (prompt classification task): Precision=0.891, Recall=0.623, F1=0.733, FPR=0.052, outperforming GPT-4o, especially on false positive rate. Category-level F1 metrics typically exceed 0.8 across the MLCommons taxonomy.

5. Robustness to Adversarial Attacks and Defense Strategies

Systematic jailbreaks—such as universal adversarial prefix injection—can bypass Llama-Guard-3 via a two-stage prefix-based attack (PRP) (Mangaokar et al., 2024). The adversary constructs an embedding-optimized prefix $\delta$ (20–40 tokens) and leverages in-context examples to force its inclusion at the start of generated outputs. When prepended, the guard’s classifier is tricked into misclassifying harmful content as safe.

PRP yields an absolute attack success rate (ASR) of ≈13% (baseline) and ≈16% with the prefix against Llama-Guard-3 (Mangaokar et al., 2024). Defense strategies include adversarial retraining with prefix-perturbed examples, input sanitization (removing or masking repetitive spans), randomized prefix injection, and ensemble guarding.

For image-based and multimodal inputs, Llama Guard 3 Vision demonstrates resilience to PGD image attacks (success rate for misclassifying harmful as safe: prompt classification clean→21%, $\epsilon$ =8→70%), and GCG text suffix attacks (prompt classification clean→4%, attack→72%). Recommendations include using both prompt and response classifiers, adversarial training, and complementary filters (e.g., perplexity detectors) (Chi et al., 2024).

6. Applications, Limitations, and Future Directions

Llama-Guard-3 variants—especially INT4 and Vision—enable embedding safety moderation directly within mobile and edge devices, facilitating real-time conversational safeguards without dependence on cloud infrastructure. Multimodal variants extend text-only moderation to image-understanding contexts (e.g., social media, customer support) (Chi et al., 2024).

Limitations include fixed output vocabulary in compact models, sensitivity to language/domain shifts, and partial vulnerability to adversarial prefix attacks. Future work may involve deeper quantization (e.g., 2 bits), hardware-aware neural architecture search, dynamic output space adaptation, enhanced robustness via adversarial and differential privacy training, as well as multi-image and multi-language modality support (Fedorov et al., 2024, Grattafiori et al., 2024, Chi et al., 2024).

Open-sourcing Llama-Guard-3 derivatives facilitates research and deployment of efficient, on-device conversational guardrails, with ongoing improvements targeting finer-grained policy categories, real-time low-latency filtering, and certified adversarial robustness in the face of increasingly sophisticated attack methods.