Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

98 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

52 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

15 tokens/sec

DeepSeek R1 via Azure Pro

33 tokens/sec

Gemini 2.5 Flash Deprecated

12 tokens/sec

2000 character limit reached

RabakBench: Localized Multilingual Safety Benchmark

Updated 14 July 2025

RabakBench is a localized safety benchmark offering a dataset of over 5,000 examples with a fine-grained harm taxonomy tailored for low-resource languages in Singapore.
It employs a three-stage pipeline—Generate, Label, and Translate—to curate, annotate, and accurately translate safety-critical data from Singlish to Chinese, Malay, and Tamil.
The benchmark exposes significant performance gaps in LLM safety classifiers across these languages, underscoring the need for culturally adaptive moderation solutions.

RabakBench is a localized, multilingual safety benchmark targeting low-resource languages and dialects in Singapore, specifically Singlish, Chinese, Malay, and Tamil. Designed to address the poor performance of existing LLM safety classifiers in multilingual and culturally contextual environments, RabakBench introduces a scalable framework for generating, annotating, and translating safety-critical data. The benchmark comprises over 5,000 safety-labeled examples and employs a fine-grained multi-label taxonomy, enabling robust evaluation of safety guardrail models in Southeast Asian contexts (2507.05980).

1. Motivation and Linguistic Context

Standard safety benchmarks for LLMs have predominantly focused on English, overlooking the linguistic diversity and unique cultural harms present in multilingual societies. Singapore, with its blend of Singlish—a creole rooted in English and influenced by Chinese, Malay, and Tamil—alongside these major languages, poses significant challenges to LLM safety classifiers. LLMs trained and benchmarked primarily on English data frequently misclassify nuanced, code-mixed, or vernacular harms. RabakBench directly responds to these gaps, constructing a dataset that reflects local linguistic realities and provides precise, nuanced harm categorization suitable for low-resource scenarios.

2. Three-Stage Pipeline Methodology

The construction of RabakBench follows a three-stage pipeline: Generate, Label, and Translate.

Stage 1: Generate

Organic Singlish web comments are curated and converted into instruction-style prompts using template-based methods.
Adversarial red teaming involves two roles: an “Attack LLM” synthesizes prompts crafted to evade existing moderation systems (such as LionGuard and other APIs), while a “Critic LLM” assesses whether these examples indeed evade detection and are misclassified, ensuring the examples are challenging for current models.

Stage 2: Label

To enable large-scale annotation while managing resource constraints, a weak supervision approach is adopted. The Alt-Test methodology quantifies agreement between proposed model annotations and human judgments.
The per-annotator and per-model agreement is calculated as:

$\rho_j^f = (1 / |\mathcal{I}_j|) \sum W_{i,j}^f$

where

$W_{i,j}^f = \begin{cases} 1, & \text{if } S(f, x_i, j) \geq S(h_j, x_i, j) \ 0, & \text{otherwise} \end{cases}$

and the overall Average Advantage Probability (AAP) is:

$\rho = \frac{1}{m} \sum_{j=1}^m \rho_j^f$

A cohort of six LLMs is considered for annotation alignment with humans. Gabine 2.0 Flash, o3-mini-low, and Claude 3.5 Haiku are selected based on demonstrated agreement. These models assign binary labels (yes/no) per harm type, and majority voting determines the final annotation.

Stage 3: Translate

The annotated Singlish examples are extended into Chinese, Malay, and Tamil through custom LLM prompts and carefully selected few-shot examples.
Translation fidelity is crucial; models such as GPT-4o mini, evaluated by cosine similarity and back-translation metrics, are used. Outputs are validated by human experts to ensure both semantic content and harm level are preserved, actively countering common translation “sanitization” effects.

3. Dataset Composition and Annotation Scheme

RabakBench consists of 5,364 safety-labeled entries, partitioned equally across Singlish, Chinese, Malay, and Tamil (1,341 examples each). Each entry is annotated for six fine-grained harm types:

Hateful content (with two severity levels)
Sexual content
Insults
Physical violence
Self-harm
Other misconduct

Labels are derived via consensus among LLM annotators, with human-verified translations ensuring linguistic nuance and toxicity are retained. This comprehensive annotation strategy yields a dataset suitable for rigorous comparative evaluations across languages and dialects, and provides a valuable resource for safety-critical multilingual LLM evaluation.

Language	Number of Examples	Key Features
Singlish	1,341	Code-mixed, locally nuanced
Chinese	1,341	Human-verified translation
Malay	1,341	Human-verified translation
Tamil	1,341	Human-verified translation

4. Guardrail Classifier Evaluation and Observed Results

Eleven open-source and commercial content moderation systems are systematically evaluated using RabakBench. Results indicate that most classifiers, when confronted with RabakBench’s non-English or code-mixed test cases, experience significant declines in accuracy. For example, AWS Bedrock Guardrail achieved an F1 score of 66.50% for Singlish but dropped below 20% for Chinese, Malay, and Tamil variants. These findings demonstrate that LLM safety classifiers, typically optimized on English benchmarks, fail to generalize harm detection to localized and multilingual contexts, revealing substantial limitations in their deployment for real-world, culturally diverse environments.

5. Reproducibility and Framework Adaptability

RabakBench’s construction pipeline—comprising adversarial data generation, scalable consensus-driven labeling, and high-fidelity translation—offers a replicable paradigm for other low-resource or culturally specific settings. Its methodology allows organizations and research groups operating in diverse linguistic environments to adapt the pipeline and construct localized benchmarks, facilitating improved training and evaluation of safety classifiers where annotated data is scarce.

A plausible implication is that this approach can catalyze the development of new benchmarks and evaluation strategies in regions with significant linguistic diversity and limited annotator pools. The use of LLMs in both red-teaming and annotation, while not entirely supplanting human oversight, can significantly lower the barrier to creating high-quality safety evaluation corpora in under-resourced contexts.

6. Implications, Limitations, and Future Directions

RabakBench highlights deficiencies in current guardrail classifiers and the need for more robust multilingual safety frameworks. While the dataset is tailored to Singapore’s linguistic ecosystem, the underlying methodology is expressly designed for broader applicability, with potential to expand geographically and linguistically, and to dynamically incorporate new harm taxonomies to address evolving societal norms and risks.

Future directions emphasized include:

Extension to additional languages and dialects, along with adaptation of the harm schema to new sociocultural, regulatory, or institutional contexts.
Further automation and refinement in adversarial generation and translation, improving coverage of slang, code-mixing, and rapidly evolving expressions of harm.
Reduced reliance on LLMs for labeling, incorporating more human-in-the-loop annotation or alternative weak supervision strategies to increase annotation fidelity.

This suggests that as LLMs and content moderation technologies mature, benchmarks like RabakBench will be instrumental in both diagnosing model failures and steering the development of culturally and linguistically adaptive safety solutions.

PDF Markdown Chat (Upgrade)

References (1)

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages (2025)