RobloxGuard-Eval Benchmark
- RobloxGuard-Eval is a taxonomy-rich benchmark that systematically assesses LLM safety guardrails using a comprehensive production content-safety taxonomy.
- It employs a robust annotation framework with 25 top-level categories, capturing diverse harm aspects including underrepresented risks like off-platform solicitations.
- The benchmark offers actionable insights for improving moderation frameworks through empirical analysis and a detailed metric suite.
RobloxGuard-Eval is a taxonomy-rich benchmark developed to facilitate the end-to-end safety evaluation of LLM guardrails and moderation frameworks. Introduced alongside Roblox Guard 1.0, it provides an extensible platform rooted in a production content-safety taxonomy for systematically assessing the effectiveness of input-output moderation methods in LLM-based systems. RobloxGuard-Eval anchors its evaluations on a comprehensive annotation scheme and robust metric suite, supporting empirical analysis across a broad array of real-world and emerging harm categories (Nandwana et al., 5 Dec 2025).
1. Safety Taxonomy
RobloxGuard-Eval leverages Roblox’s production content-safety taxonomy as its organizational backbone. This taxonomy features 25 distinct top-level categories, explicitly designed to span a representative diversity of harms encountered in online environments. The categories include domains that are historically underrepresented in prior benchmarks, such as off-platform solicitations and deceptive monetization. No further public breakdown into subcategories is disclosed in the original publication.
| Category Example | Category Example | Category Example |
|---|---|---|
| Child Exploitation | Intellectual Property Violations | Cheating and Scams |
| Threats, Bullying, and Harassment | Prohibited Advertising Practices | Soliciting Donations: Tipping |
| Discrimination, Slurs, and Hate Speech | Sharing Personal Information | Misusing Roblox Systems: Jailbreaking |
| Real-World Sensitive Events | Terrorism and Violent Extremism | Suicide, Self-Injury, and Harmful Behavior |
| Romantic and Sexual Content | Violent |