Roblox Guard 1.0: Taxonomy-Adaptive Moderation

Updated 12 December 2025

Roblox Guard 1.0 is a moderation system that adapts to diverse safety taxonomies using an 8B-parameter Llama-3.1-8B-Instruct with LoRA fine-tuning.
It leverages chain-of-thought rationales and extensive synthetic and human safety datasets to enhance input-output moderation robustness.
Benchmarking via RobloxGuard-Eval shows state-of-the-art F1 scores for prompt and response moderation across 23 distinct safety categories.

Roblox Guard 1.0 is a taxonomy-adaptive moderation model designed to provide robust input and output guardrails for LLM systems. Developed on the Llama-3.1-8B-Instruct backbone and instruction fine-tuned to generalize across diverse, unseen safety taxonomies, it delivers comprehensive moderation by leveraging chain-of-thought (CoT) rationales, extensive human and synthetic safety corpora, and input inversion for robust and adaptive response behaviors. Roblox Guard 1.0 introduces RobloxGuard-Eval, a benchmark curated for extensible, fine-grained evaluation of LLM moderation capabilities (Nandwana et al., 5 Dec 2025).

1. Model Architecture and Adaptation

The foundation of Roblox Guard 1.0 is the Llama-3.1-8B-Instruct Transformer. This 8-billion-parameter architecture employs standard multi-head self-attention and feed-forward networks. Formally, attention at each layer $\ell$ follows:

$A^\ell = \mathrm{softmax}(Q^\ell K^{\ell\top}/\sqrt{d_k}) V^\ell$

where $Q^\ell, K^\ell, V^\ell \in \mathbb{R}^{T \times d_k}$ are the projected query, key, and value matrices, $T$ is the sequence length, and $d_k$ is the dimensionality per head. The feed-forward sublayer is structured as:

$\mathrm{FFN}(x) = W_2\, \mathrm{GeLU}(W_1 x + b_1) + b_2.$

Fine-tuning is performed via Low-Rank Adaptation (LoRA), which reparameterizes each full-rank weight matrix $W \in \mathbb{R}^{d \times d}$ by $W' = W_0 + \Delta W$ , where $\Delta W = AB$ , $A \in \mathbb{R}^{d \times r}$ , $B \in \mathbb{R}^{r \times d}$ , and $r=16$ . Only the low-rank factors $A, B$ are updated, with $W_0$ frozen. No further architecture modifications—such as alternate attention layers or prompt encoders—are applied [(Nandwana et al., 5 Dec 2025), Sec. “Experiments → Training”].

2. Instruction Fine-Tuning Methodology

Roblox Guard 1.0 is instruction fine-tuned on a large, diverse set of public and synthetic safety datasets. Key corpora include Aegis 1.0 (14.8K), Aegis 2.0 (9.4K), WildGuard (86.7K), BeaverTails (99.5K), and synthetically generated datasets (Llama Synthetic 53.8K, Mistral Synthetic 60.0K, Qwen Synthetic 60.0K), totaling 384,233 examples [Table 3].

Chain-of-Thought (CoT) rationales augment a subset of training examples: each uses DeepSeek-R1 to generate multistep reasoning following the particular dataset’s taxonomy. Training targets follow the format “[Chain-of-Thought] → [Label] → [Category]”, with standard next-token cross-entropy applied over the combined sequence. To mitigate overfitting to a canonical sequence, input inversion randomly permutes the order of CoT, Label, and Category for each training instance (six total permutations). This permutation-based regularization is targeted at enhancing robustness with respect to templatic and previously unseen taxonomies [(Nandwana et al., 5 Dec 2025), Sec. “Datasets”].

3. Taxonomy-Adaptive Moderation Mechanism

Central to Roblox Guard 1.0 is its taxonomy-adaptive moderation paradigm. The model is designed to ingest an arbitrary taxonomy $T = \{c_1, ..., c_n\}$ —consisting of category names with corresponding definitions—at inference, enabling zero-shot adaptation. Roblox’s operational safety taxonomy enumerates 25 distinct categories, including but not limited to: Child Exploitation, Terrorism & Violent Extremism, Harassment, Hate Speech, and Platform Misuse [Table 1].

Formally, for each input $x$ (either the user prompt or the prompt with response), moderation is executed via:

$\text{Moderation}(x;T) = \{y_i\}_{i=1}^n$

where each $y_i$ is a binary decision for category $c_i$ . Operationally, for every $c_i$ , the instruction:

1	Given the policy: def_i. Does x violate category c_i? Answer yes/no and optionally explain.

is submitted, and the model’s output

y_i

is aggregated. No retraining is necessary for taxonomy changes; only the instruction set is updated. This architectural compositionality underpins the generalization to taxonomies unseen during training (Nandwana et al., 5 Dec 2025).

4. Input–Output Moderation Pipeline

The moderation framework operates as a two-stage pipeline:

Input-Level Moderation: Each user prompt $P$ is first moderated with respect to taxonomy $T$ . If any category is flagged, the prompt is blocked or refused.
Output-Level Moderation: If the prompt passes, the main LLM (e.g., a chat model) generates response $R$ . The concatenated pair $(P \Vert R)$ is then moderated, and any violation leads to post-processing or refusal to deliver the output.

Both stages utilize Roblox Guard 1.0 and may incorporate ensemble models through majority vote or max-probability rules. Classification is thresholded (e.g., if $P(\text{violation}) \geq \tau$ , with $\tau=0.5$ ). On AWS g6.12xlarge using vLLM serving, classifying a 790-token sequence incurs ≈870 ms latency, signifying high but not sub-second throughput [(Nandwana et al., 5 Dec 2025), Sec. “Experiments → Training”].

5. Benchmarking and Quantitative Outcomes

RobloxGuard-Eval is an extensible, open-source benchmark designed for rigorous evaluation of LLM moderation capabilities. It consists of 2,872 expert-red-teamed examples across 23 categories [Table 2]. Prompt-level and response-level evaluations employ various corpora (Aegis, OpenAI Mod, WildGuard, Toxic Chat, XSTest, BeaverTails, HarmBench, etc.), with F1-score as the principal metric and auxiliary tracking of per-category false positives/negatives.

Notable results include:

Prompt-level F1 on Aegis 1.0: 91.9% (surpassing BingoGuard and GPT-4o).
Response-level F1 on BeaverTails: 87.3% (GPT-4o: 83.8%).
On RobloxGuard-Eval: 79.6% F1; other models fall below 30%.
Out-of-domain datasets: Toxic Chat (79.1% F1), SafeRLHF (69.9%), XSTest (86.4%), HarmBench (85.7%) [(Nandwana et al., 5 Dec 2025), Table 7].

Benchmark	Roblox Guard 1.0 F1	Best Competing Model	F1 (Competing Model)
Aegis 1.0 (Prompt)	91.9%	BingoGuard-8B	90.4%
BeaverTails	87.3%	GPT-4o	83.8%
RobloxGuard-Eval	79.6%	LlamaGuard3-8B	<30%

The model demonstrates pronounced robustness and generalization, especially for categories not explicitly seen during training.

6. Empirical Analysis and Prospective Directions

Ablation studies conducted within the source paper indicate the criticality of data diversity and augmentation strategies:

Exclusion of synthetic data reduces RobloxGuard-Eval F1 from 79.6% to 20.3%; OpenAI Mod F1 from 70.3% to 49.2%.
Omission of CoT reduces performance by 3.9–4.4 percentage points on tasks requiring reasoning (Aegis 2.0 Response, HarmBench), while slightly improving performance on less complex data.
Removing input inversion decreases F1 on XSTest and WildGuard Response by 2.6–3.0 percentage points [Table 8].

Identified limitations include over-refusal on highly adversarial or ambiguous prompts (noted on XSTest), occasional boundary-category confusion (e.g., Profanity vs. Harassment), and latency that may not be compatible with demanding sub-second user interfaces.

Proposed enhancements focus on hierarchical taxonomy encoding (shared category embeddings), calibration of categorical probabilities, shift to severity scoring (for graduated responses), and expansion of synthetic adversarial data (e.g., code-injection, multimodal input types).

In sum, Roblox Guard 1.0 presents a taxonomy-aware, instruction-tuned moderation system combining LoRA-powered Llama-3.1-8B-Instruct, synthetic and open datasets, CoT rationales, and input-order regularization, achieving state-of-the-art moderation performance verified by a newly introduced evaluation suite (Nandwana et al., 5 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Roblox Guard 1.0.

Roblox Guard 1.0: Taxonomy-Adaptive Moderation

1. Model Architecture and Adaptation

2. Instruction Fine-Tuning Methodology

3. Taxonomy-Adaptive Moderation Mechanism

4. Input–Output Moderation Pipeline

5. Benchmarking and Quantitative Outcomes

6. Empirical Analysis and Prospective Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Roblox Guard 1.0: Taxonomy-Adaptive Moderation

1. Model Architecture and Adaptation

2. Instruction Fine-Tuning Methodology

3. Taxonomy-Adaptive Moderation Mechanism

4. Input–Output Moderation Pipeline

5. Benchmarking and Quantitative Outcomes

6. Empirical Analysis and Prospective Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research