Papers
Topics
Authors
Recent
2000 character limit reached

Roblox Guard 1.0: Taxonomy-Adaptive Moderation

Updated 12 December 2025
  • Roblox Guard 1.0 is a moderation system that adapts to diverse safety taxonomies using an 8B-parameter Llama-3.1-8B-Instruct with LoRA fine-tuning.
  • It leverages chain-of-thought rationales and extensive synthetic and human safety datasets to enhance input-output moderation robustness.
  • Benchmarking via RobloxGuard-Eval shows state-of-the-art F1 scores for prompt and response moderation across 23 distinct safety categories.

Roblox Guard 1.0 is a taxonomy-adaptive moderation model designed to provide robust input and output guardrails for LLM systems. Developed on the Llama-3.1-8B-Instruct backbone and instruction fine-tuned to generalize across diverse, unseen safety taxonomies, it delivers comprehensive moderation by leveraging chain-of-thought (CoT) rationales, extensive human and synthetic safety corpora, and input inversion for robust and adaptive response behaviors. Roblox Guard 1.0 introduces RobloxGuard-Eval, a benchmark curated for extensible, fine-grained evaluation of LLM moderation capabilities (Nandwana et al., 5 Dec 2025).

1. Model Architecture and Adaptation

The foundation of Roblox Guard 1.0 is the Llama-3.1-8B-Instruct Transformer. This 8-billion-parameter architecture employs standard multi-head self-attention and feed-forward networks. Formally, attention at each layer \ell follows:

A=softmax(QK/dk)VA^\ell = \mathrm{softmax}(Q^\ell K^{\ell\top}/\sqrt{d_k}) V^\ell

where Q,K,VRT×dkQ^\ell, K^\ell, V^\ell \in \mathbb{R}^{T \times d_k} are the projected query, key, and value matrices, TT is the sequence length, and dkd_k is the dimensionality per head. The feed-forward sublayer is structured as:

FFN(x)=W2GeLU(W1x+b1)+b2.\mathrm{FFN}(x) = W_2\, \mathrm{GeLU}(W_1 x + b_1) + b_2.

Fine-tuning is performed via Low-Rank Adaptation (LoRA), which reparameterizes each full-rank weight matrix WRd×dW \in \mathbb{R}^{d \times d} by W=W0+ΔWW' = W_0 + \Delta W, where ΔW=AB\Delta W = AB, ARd×rA \in \mathbb{R}^{d \times r}, BRr×dB \in \mathbb{R}^{r \times d}, and r=16r=16. Only the low-rank factors A,BA, B are updated, with W0W_0 frozen. No further architecture modifications—such as alternate attention layers or prompt encoders—are applied [(Nandwana et al., 5 Dec 2025), Sec. “Experiments → Training”].

2. Instruction Fine-Tuning Methodology

Roblox Guard 1.0 is instruction fine-tuned on a large, diverse set of public and synthetic safety datasets. Key corpora include Aegis 1.0 (14.8K), Aegis 2.0 (9.4K), WildGuard (86.7K), BeaverTails (99.5K), and synthetically generated datasets (Llama Synthetic 53.8K, Mistral Synthetic 60.0K, Qwen Synthetic 60.0K), totaling 384,233 examples [Table 3].

Chain-of-Thought (CoT) rationales augment a subset of training examples: each uses DeepSeek-R1 to generate multistep reasoning following the particular dataset’s taxonomy. Training targets follow the format “[Chain-of-Thought] → [Label] → [Category]”, with standard next-token cross-entropy applied over the combined sequence. To mitigate overfitting to a canonical sequence, input inversion randomly permutes the order of CoT, Label, and Category for each training instance (six total permutations). This permutation-based regularization is targeted at enhancing robustness with respect to templatic and previously unseen taxonomies [(Nandwana et al., 5 Dec 2025), Sec. “Datasets”].

3. Taxonomy-Adaptive Moderation Mechanism

Central to Roblox Guard 1.0 is its taxonomy-adaptive moderation paradigm. The model is designed to ingest an arbitrary taxonomy T={c1,...,cn}T = \{c_1, ..., c_n\}—consisting of category names with corresponding definitions—at inference, enabling zero-shot adaptation. Roblox’s operational safety taxonomy enumerates 25 distinct categories, including but not limited to: Child Exploitation, Terrorism & Violent Extremism, Harassment, Hate Speech, and Platform Misuse [Table 1].

Formally, for each input xx (either the user prompt or the prompt with response), moderation is executed via:

Moderation(x;T)={yi}i=1n\text{Moderation}(x;T) = \{y_i\}_{i=1}^n

where each yiy_i is a binary decision for category cic_i. Operationally, for every cic_i, the instruction:

1
Given the policy: def_i. Does x violate category c_i? Answer yes/no and optionally explain.
is submitted, and the model’s output yiy_i is aggregated. No retraining is necessary for taxonomy changes; only the instruction set is updated. This architectural compositionality underpins the generalization to taxonomies unseen during training (Nandwana et al., 5 Dec 2025).

4. Input–Output Moderation Pipeline

The moderation framework operates as a two-stage pipeline:

  1. Input-Level Moderation: Each user prompt PP is first moderated with respect to taxonomy TT. If any category is flagged, the prompt is blocked or refused.
  2. Output-Level Moderation: If the prompt passes, the main LLM (e.g., a chat model) generates response RR. The concatenated pair (PR)(P \Vert R) is then moderated, and any violation leads to post-processing or refusal to deliver the output.

Both stages utilize Roblox Guard 1.0 and may incorporate ensemble models through majority vote or max-probability rules. Classification is thresholded (e.g., if P(violation)τP(\text{violation}) \geq \tau, with τ=0.5\tau=0.5). On AWS g6.12xlarge using vLLM serving, classifying a 790-token sequence incurs ≈870 ms latency, signifying high but not sub-second throughput [(Nandwana et al., 5 Dec 2025), Sec. “Experiments → Training”].

5. Benchmarking and Quantitative Outcomes

RobloxGuard-Eval is an extensible, open-source benchmark designed for rigorous evaluation of LLM moderation capabilities. It consists of 2,872 expert-red-teamed examples across 23 categories [Table 2]. Prompt-level and response-level evaluations employ various corpora (Aegis, OpenAI Mod, WildGuard, Toxic Chat, XSTest, BeaverTails, HarmBench, etc.), with F1-score as the principal metric and auxiliary tracking of per-category false positives/negatives.

Notable results include:

  • Prompt-level F1 on Aegis 1.0: 91.9% (surpassing BingoGuard and GPT-4o).
  • Response-level F1 on BeaverTails: 87.3% (GPT-4o: 83.8%).
  • On RobloxGuard-Eval: 79.6% F1; other models fall below 30%.
  • Out-of-domain datasets: Toxic Chat (79.1% F1), SafeRLHF (69.9%), XSTest (86.4%), HarmBench (85.7%) [(Nandwana et al., 5 Dec 2025), Table 7].
Benchmark Roblox Guard 1.0 F1 Best Competing Model F1 (Competing Model)
Aegis 1.0 (Prompt) 91.9% BingoGuard-8B 90.4%
BeaverTails 87.3% GPT-4o 83.8%
RobloxGuard-Eval 79.6% LlamaGuard3-8B <30%

The model demonstrates pronounced robustness and generalization, especially for categories not explicitly seen during training.

6. Empirical Analysis and Prospective Directions

Ablation studies conducted within the source paper indicate the criticality of data diversity and augmentation strategies:

  • Exclusion of synthetic data reduces RobloxGuard-Eval F1 from 79.6% to 20.3%; OpenAI Mod F1 from 70.3% to 49.2%.
  • Omission of CoT reduces performance by 3.9–4.4 percentage points on tasks requiring reasoning (Aegis 2.0 Response, HarmBench), while slightly improving performance on less complex data.
  • Removing input inversion decreases F1 on XSTest and WildGuard Response by 2.6–3.0 percentage points [Table 8].

Identified limitations include over-refusal on highly adversarial or ambiguous prompts (noted on XSTest), occasional boundary-category confusion (e.g., Profanity vs. Harassment), and latency that may not be compatible with demanding sub-second user interfaces.

Proposed enhancements focus on hierarchical taxonomy encoding (shared category embeddings), calibration of categorical probabilities, shift to severity scoring (for graduated responses), and expansion of synthetic adversarial data (e.g., code-injection, multimodal input types).

In sum, Roblox Guard 1.0 presents a taxonomy-aware, instruction-tuned moderation system combining LoRA-powered Llama-3.1-8B-Instruct, synthetic and open datasets, CoT rationales, and input-order regularization, achieving state-of-the-art moderation performance verified by a newly introduced evaluation suite (Nandwana et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Roblox Guard 1.0.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube