Papers
Topics
Authors
Recent
2000 character limit reached

AprielGuard: Unified 8B Safeguard Model

Updated 24 December 2025
  • AprielGuard is an 8B-parameter model that employs a unified taxonomy to simultaneously detect safety risks and adversarial attacks in large language models.
  • It utilizes structured reasoning traces and a multi-task fine-tuning framework to enhance interpretability and robust moderation in varied conversational and agentic contexts.
  • Empirical evaluations show competitive F1 scores and significant performance gains over baselines in multi-turn workflows and long-context moderation scenarios.

AprielGuard is an 8-billion-parameter Transformer-based safeguard model designed to address the unified detection and mitigation of safety risks and adversarial manipulations in LLMs. Unlike prior moderation systems that separately treat toxicity, bias, and prompt-based attacks, AprielGuard introduces a single taxonomy and training approach to generalize across both dimensions, enabling robust safeguarding in diverse conversational and agentic settings. The model is trained over a broad spectrum of synthetic and open data, augmented with explicit structured reasoning traces to facilitate interpretability. Empirical evaluations demonstrate competitive performance relative to leading guardrails, especially in complex scenarios involving multi-turn agentic workflows and long-context inputs (Kasundra et al., 23 Dec 2025).

1. Unified Taxonomy of Risks and Threats

AprielGuard formalizes input assessment through a joint categorization scheme, mapping each sample xx to (s,a)S×A(s,a) \in S \times A, where S={safe,unsafe}S = \{\text{safe}, \text{unsafe}\} and A={non_adversarial,adversarial}A = \{\text{non\_adversarial}, \text{adversarial}\}. This yields four high-level classes:

  • safe & non-adversarial
  • unsafe & non-adversarial
  • safe & adversarial
  • unsafe & adversarial

Safety Risks Taxonomy

Drawing on the 16-category SALAD Data taxonomy [Li et al. 2024], AprielGuard labels unsafe content using a hierarchical framework. The top-level categories include "Toxic Content," "Unfair Representation," and "Violation of Personal Property," with further refinement into second and third-tier harms (e.g., Level 1 “Toxicity Harms” → Level 2 “Toxic content” → Level 3 “Hate speech”, “Insult”, etc.).

Adversarial Attacks Taxonomy

The model addresses a spectrum of adversarial strategies, including code encodings (Base64, ROT13, SQL obfuscation), prompt injections (“Ignore previous instructions”, sandwich attacks), stylizing (leetspeak, punctuation substitutions), rhetorical manipulation (reverse psychology), role-playing (“DAN”, persona-based prompts), and meta-prompting (delayed attacks, perspective shifts).

Formally, the outputs are multi-label vectors:

y(s){0,1}16(safety categories),y(a){0,1}n(adversarial strategies)y^{(s)} \in \{0,1\}^{16} \quad\text{(safety categories)},\quad y^{(a)} \in \{0,1\}^{n} \quad\text{(adversarial strategies)}

2. Model Architecture and Capacity

AprielGuard is implemented as a single 8B-parameter encoder–decoder Transformer, structured for chat moderation tasks and derived from a downscaled variant of the Apriel-1.5–15B-Thinker backbone. Content isolation is managed with custom <|content|>…<|end|> tags, ensuring the separation between moderation instructions and user inputs. The forward pass operates as:

h0=embedding(w),h+1=TransformerLayer(h),P=softmax(WohL+bo)h^0 = \text{embedding}(w), \quad h^{\ell+1} = \text{TransformerLayer}_\ell(h^\ell), \quad P = \text{softmax}(W_o h^L + b_o)

where PP is the moderation output distribution. No further architectural specifics (e.g., attention heads, layer count) are provided beyond core Transformer conventions.

3. Learning Framework and Objective Functions

The supervised fine-tuning framework utilizes labeled moderation datasets for joint prediction of safety and adversarial labels. The objective is a multi-task sum of cross-entropy losses:

L(θ)=LCE(y(s),p(s))+LCE(y(a),p(a)),LCE(y,p)=iyilogpiL(\theta) = L_{CE}(y^{(s)}, p^{(s)}) + L_{CE}(y^{(a)}, p^{(a)}), \quad L_{CE}(y,p) = -\sum_i y_i \log p_i

Training proceeds for three epochs with a 2×1042 \times 10^{-4} learning rate, batch size 1 (with accumulation across 8 steps), using the Adam optimizer (β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999). No auxiliary regularizers or loss components are disclosed.

4. Training Data Composition

AprielGuard is trained on approximately 250,000 synthetic samples, stratified across standalone prompts, multi-turn conversations, and agentic workflows. The joint safe–unsafe and adversarial–nonadversarial splits are detailed in Table 1 of the source:

Label Conversations Standalone
Safe & non-adversarial 100,736 74,706
Safe & adversarial 13,482 7,594
Unsafe & non-adversarial 18,980 40,244
Unsafe & adversarial 26,901 45,071

Data is synthesized via Mixtral-8x7B and uncensored Llama-8B with rule-based and LLM adversarial generation. Filtering leverages semantic embeddings (threshold 0.7), ROUGE-L (0.9), and LLM filtering for removal of ineffective ("refusal") samples. Augmentation techniques include character-level noise, leetspeak, and paraphrasing/reordering. All data is synthetic, with representation for both open benchmarks and proprietary agentic/jailbreak scenarios.

5. Structured Reasoning Traces for Interpretability

Interpretability is enhanced via reasoning traces embedded directly in training and inference workflows. The annotation pipeline assigns moderation labels, selects one of eight prompt templates, and incorporates reasoning breakdowns within <reasoning>…</reasoning> tags, alongside final moderation decisions in <result> tags. Formatting and label alignment are validated to ensure consistency.

Deployment supports two operational modes:

  • Without reasoning: outputs basic safety and adversarial tags
  • With reasoning: provides stepwise <reasoning> annotations with explicit verdicts

At inference, the moderation pseudocode (as specified in prompts) directs assessment based on conversational turn:

1
2
3
4
5
6
if conversation ends with assistant turn:
    assess safety of last assistant response, adversarial of last user turn
else if ends with user turn:
    assess both safety and adversarial of last user message
else:
    no-op

This dual-mode output facilitates both concise moderation and interpretability for downstream analysis.

6. Empirical Evaluation

AprielGuard is evaluated across multiple public and proprietary benchmarks:

Public Safety-Risks Benchmarks

On 44,699 examples from nine English datasets, the model achieves (w/o reasoning): Precision 0.87, Recall 0.89, F1 0.88, FPR 0.11; with reasoning, F1 decreases marginally to 0.87.

Public Adversarial-Attack Benchmarks

For 18,073 adversarial samples, results are: Precision 0.94, Recall 0.92, F1 0.93, FPR 0.11 (no reasoning); with reasoning, F1 0.92.

Comparative Performance

Model Safety Risks (F1 / FPR) Adversarial (F1 / FPR)
AprielGuard-8B (w/o reasoning) 0.88 / 0.11 0.93 / 0.11
IBM-Granite-3.3-8B 0.87 / 0.15 0.87 / 0.12
Llama-Guard-3-8B 0.79 / 0.05 0.53 / 0.04

This suggests AprielGuard offers meaningful gains in F1 scores over open-source baselines, especially in adversarial contexts.

Agentic Workflow and Long-Context Scenarios

On in-house multi-step agentic jailbreaking data (\approx4,300 samples), AprielGuard records Safety F1 0.86 (FPR 0.02), Adversarial F1 0.95 (FPR 0.01), outperforming baselines by >0.40\gt 0.40 F1 on adversarial detection. For long-context moderation (up to 32K tokens, 282 examples), F1 approaches 0.97 for safety detection.

Multilingual Evaluation

Benchmarks translated via MADLAD-400-3B-MT into eight languages maintain F1 within 2–4 points of English performance, including French, German, Spanish, Portuguese, Italian, Dutch, Japanese (minor drop), and French-Canadian.

7. Open-Sourcing and Reproducibility

AprielGuard’s developers plan a comprehensive release package including:

  • AprielGuard-8B model weights under a permissive license
  • Full training and evaluation code (data-generation, filtering/augmentation scripts)
  • Entire benchmark suite spanning public chats, proprietary agentic datasets, long-context, and multilingual sets

The explicit goal is to support transparent audits, facilitate research into reasoning-based interpretability, and enable rigorous cross-model comparisons utilizing the unified taxonomy. A plausible implication is greater reproducibility and extensibility for research into LLM robustness and moderation frameworks (Kasundra et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.
AprielGuard (2025)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AprielGuard.