RobloxGuard-Eval Benchmark

Updated 12 December 2025

RobloxGuard-Eval is a taxonomy-rich benchmark that systematically assesses LLM safety guardrails using a comprehensive production content-safety taxonomy.
It employs a robust annotation framework with 25 top-level categories, capturing diverse harm aspects including underrepresented risks like off-platform solicitations.
The benchmark offers actionable insights for improving moderation frameworks through empirical analysis and a detailed metric suite.

RobloxGuard-Eval is a taxonomy-rich benchmark developed to facilitate the end-to-end safety evaluation of LLM guardrails and moderation frameworks. Introduced alongside Roblox Guard 1.0, it provides an extensible platform rooted in a production content-safety taxonomy for systematically assessing the effectiveness of input-output moderation methods in LLM-based systems. RobloxGuard-Eval anchors its evaluations on a comprehensive annotation scheme and robust metric suite, supporting empirical analysis across a broad array of real-world and emerging harm categories (Nandwana et al., 5 Dec 2025).

1. Safety Taxonomy

RobloxGuard-Eval leverages Roblox’s production content-safety taxonomy as its organizational backbone. This taxonomy features 25 distinct top-level categories, explicitly designed to span a representative diversity of harms encountered in online environments. The categories include domains that are historically underrepresented in prior benchmarks, such as off-platform solicitations and deceptive monetization. No further public breakdown into subcategories is disclosed in the original publication.

Category Example	Category Example	Category Example
Child Exploitation	Intellectual Property Violations	Cheating and Scams
Threats, Bullying, and Harassment	Prohibited Advertising Practices	Soliciting Donations: Tipping
Discrimination, Slurs, and Hate Speech	Sharing Personal Information	Misusing Roblox Systems: Jailbreaking
Real-World Sensitive Events	Terrorism and Violent Extremism	Suicide, Self-Injury, and Harmful Behavior
Romantic and Sexual Content	Violent

Markdown Report Issue Upgrade to Chat

References (1)

Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RobloxGuard-Eval.