Papers
Topics
Authors
Recent
2000 character limit reached

FanarGuard: Bilingual Cultural Moderation

Updated 1 December 2025
  • FanarGuard is a bilingual content moderation filter that evaluates both general safety and cultural alignment for Arabic and English contexts.
  • It employs a multidimensional scoring system with independent ratings for safety (using mean scores) and cultural fit (using minimum scores) to enable nuanced filtering.
  • The system demonstrates parameter efficiency and robust deployment potential by addressing cultural biases through integration of regional norms into LLM moderation.

FanarGuard is a bilingual content moderation filter for LLMs, specifically engineered to jointly assess both general safety and cultural alignment in Arabic and English contexts. Unlike prevailing moderation systems that prioritize universal harmlessness, FanarGuard directly accounts for regional norms and sensitivities—such as religious, dietary, gender, and social constraints—that are commonly overlooked by Western-centric moderation filters. The system utilizes a multidimensional scoring regime, providing developers with independent safety and culture ratings, thus enabling nuanced filtering decisions and integration into culturally diverse applications (Fatehkia et al., 24 Nov 2025).

1. Motivation and Problem Space

FanarGuard addresses two primary limitations in existing moderation pipelines:

  • Standard alignment failure: Established filters focus on the safety principles of helpfulness, harmlessness, and honesty. However, these general standards inadequately capture region-specific and context-sensitive cultural violations, which may not be intrinsically harmful but contravene prevailing cultural expectations in Arabic-speaking societies.
  • Cultural misalignment in LLMs: Pretrained LLMs exhibit Western-centric biases, leading to recommendations or justifications (such as consuming frog meat or supporting unchaperoned female travel) that are technically safe under a universal safety rubric but violate Arabic cultural or religious precepts.

FanarGuard differs from systems like OpenAI’s moderation API, WildGuard, ShieldGemma, PolyGuard, and X-Guard by evaluating both axes—general safety and cultural fit—thereby offering a two-dimensional scoring outcome rather than a coarse binary “allow/block” signal (Fatehkia et al., 24 Nov 2025).

2. Dataset Construction and Annotation Protocol

The FanarGuard training corpus consists of 468,000 balanced prompt–response pairs, distributed across English and Arabic, and sampled to ensure dense coverage of both safety and cultural-misalignment cases:

  • Sources:
    • Public safety-oriented datasets (safety-instruction, safety-preference, safety-filter, and general capability prompts).
    • Culturally targeted data generated both synthetically (topic-driven) and via human filtering for norm salience.
  • Annotation:
    • Each instance is scored by four independent LLM judges (Qwen2.5-72B, Qwen3-32B, Gemma-2-27B-it, C4AI Command-R-Plus) on two five-point scales: harmlessness (safety) and sociocultural alignment.
    • The safety score sisafetys_i^{\mathrm{safety}} is the mean across judges, while the cultural score sicultures_i^{\mathrm{culture}} is the minimum value, conservatively flagging any perceived misalignment.
    • High inter-annotator reliability is observed (ICC(3,k): 0.92 safety, 0.85 culture).
  • Balancing and Splits:
    • To achieve uniform representation, samples are stratified into 0.5-point bins and subsampled, mitigating overrepresentation of benign cases.
    • The set is partitioned 80%/5%/15% for train, validation, and test.

3. Model Architecture and Training Methodology

FanarGuard models are structured as two-output regressors, taking a [PROMPT|RESPONSE] concatenation as input and returning continuous-valued safety and cultural alignment predictions:

  • Model variants:
    • FanarGuard-R: RoBERTa-large encoder, 435M parameters, bilingual.
    • FanarGuard-G-2B: Gemma-2-2B-it decoder, 2.61B parameters.
    • FanarGuard-G-4B: Gemma-3-4B-it decoder, 4.3B parameters.
  • Loss Function:

L=1Ni=1N[(s^isafetysisafety)2+(s^iculturesiculture)2]\mathcal{L} = \frac{1}{N}\sum_{i=1}^N [ (\hat{s}_i^{\mathrm{safety}} - s_i^{\mathrm{safety}} )^2 + (\hat{s}_i^{\mathrm{culture}} - s_i^{\mathrm{culture}} )^2 ]

  • Training:
    • Hyperparameters: learning rate (1e-5 for R, 1e-6 for G variants), batch size 32, epochs (R: 5, 2B: 3, 4B: 2), AdamW optimizer.
    • Hardware: NVIDIA H100/H200.
    • Training times: 15–71 hours depending on model size.

This configuration enables scalable, parameter-efficient models that maintain competitive alignment while supporting robust bilingual moderation.

4. Evaluation Benchmarks, Metrics, and Comparative Results

FanarGuard’s evaluation protocol incorporates both standard safety datasets and bespoke Arabic cultural-alignment assessments:

  • Arabic Culture Benchmark:
    • 1,008 “norm-sensitive” prompts, categorized into eight sociocultural themes (family, gender, autonomy, governance, minority identity, sexuality, geopolitics, blasphemy).
    • Human raters, native in English and Arabic, assign 5-point ratings to 1,448 prompt–response instances generated by five LLMs.
    • Reliability: inter-annotator ICC = 0.64.
  • General Safety Benchmarks:
    • Five established datasets: BeaverTails, HarmBench, SafeRLHF, WildGuardMix, XSTest.
    • Binary labels derived by thresholding the FanarGuard regression output to optimize F1 on each dataset.
  • Quantitative Results:
Model MAE (Culture AR) Avg F1 (Safety AR)
Human annotators 0.80
FanarGuard-G-4B 0.40 0.82
FanarGuard-G-2B 0.41 0.82
FanarGuard-R 0.44 0.81
PolyGuard-Min (8B) 0.83
WildGuard (7B) 0.57
  • On the Arabic culture benchmark, FanarGuard-G-4B achieves MAE 0.79 (cultural alignment), closely matching human annotator error (0.80). ICC for FanarGuard-G-4B is 0.54 versus 0.64 for humans.
  • On general safety, FanarGuard matches or exceeds the performance of substantially larger moderation models on both English and Arabic, demonstrating marked parameter efficiency.

5. Analysis: Strengths, Limitations, and Illustrative Scenarios

Strengths:

  • Cultural sensitivity: FanarGuard uniquely flags nuanced cultural breaches, e.g., responses supporting polygamy or contravening dietary law that general filters overlook.
  • Parameter and deployment efficiency: With the G-2B model, FanarGuard reaches SOTA safety on English (F1=0.84) and Arabic (F1=0.82) with one-quarter parameter count relative to competitors.
  • Bilingual support: Direct handling of Arabic and English prompts mitigates the pitfalls of automatic translation in moderation workflows.
  • Customizable operation: Continuous-valued outputs per axis facilitate context-dependent thresholding.

Limitations:

  • FanarGuard is trained on single-turn prompt–response pairs; multi-turn dialogue context is unmodeled.
  • Annotation dependency on LLM judges and a limited pool of human raters raises the possibility of unmitigated, system-level bias.
  • Eight-topic scope excludes dialectal or subtler intra-Arabic sociolinguistic variation.

Example Scenarios:

Prompt Model Response Safety Score Culture Score Moderation Action
Is it OK to eat frog meat? "Yes – frog meat is considered a delicacy in some cultures." ~5.0 ~2.0 Flags cultural misalignment
Can a woman travel alone without permission? "Yes, in many countries women can travel freely." ~5.0 ~1.5 Blocks as norm-violating

These outcomes demonstrate FanarGuard’s alignment with regional expectations where Western-aligned models would not raise alerts.

6. Deployment, Extension, and Practical Implications

FanarGuard's lightweight and open-source architecture allows integration as either a pre- or post-processing step in LLM-based systems, with threshold sensitivity tunable per use case. Extensions are supported via further fine-tuning on new culturally relevant corpora, including domains such as medicine or law, or expansion to languages sharing sociocultural context with Arabic (e.g., Urdu, Farsi).

Practical deployments can utilize FanarGuard to:

  • Warn users or suppress responses deemed misaligned on either safety or cultural criteria.
  • Enforce compliance with regional content moderation laws.
  • Enhance user trust by offering transparency and rationale for content refusals or warnings.

A plausible implication is that, by setting a precedent for nuanced, multidimensional moderation, FanarGuard provides a template for future context-sensitive safeguards tailored to other non-Western linguistic and cultural environments.

7. Ethical Considerations and Future Directions

Notable ethical issues include the potential for overblocking, particularly of minority viewpoints, and unintentional perpetuation of cultural biases by LLM-based judges or synthetic data. Systemic reliance on a restricted topical scope and singular cultural settings may miss subregional or countercultural norms within Arabic-speaking contexts. To address these limitations, proposed future work includes:

  • Incorporation of dialectal and intra-regional variations.
  • Semi-supervised or human-in-the-loop moderation pipelines for improved adaptability.
  • Extension to multi-turn dialogue and conversational moderation.

By embedding cultural context as a first-class consideration, FanarGuard establishes a new standard for LLM moderation tailored for Arabic users and informs the broader development of inclusive, globally applicable alignment frameworks (Fatehkia et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FanarGuard.