BGE-M3 Classifier: Ethical Multilingual Filtering
- The paper introduces BGE-M3, a Transformer-based classifier that leverages a multilingual, fairness-aware approach to proactively screen harmful prompts.
- It utilizes tailored architectural initialization, balanced batching, and Class-Balanced Focal Loss with a dual-language dataset to optimize performance.
- Empirical evaluation shows an F1-score of approximately 0.81, demonstrating robust cross-lingual detection of harmful content in both English and Vietnamese.
The BGE-M3 classifier is a fine-tuned, multilingual Transformer model for proactive textual prompt screening at the input stage of generative text-to-image pipelines. Developed as the core filtering module within the SafeGen framework, BGE-M3 identifies and blocks prompts likely to elicit harmful, biased, or misleading imagery, with demonstrated robustness across English and Vietnamese. Its performance derives from tailored architectural initialization, a curated dual-language dataset, and a novel fairness-aware optimization scheme that exploits class-balanced sampling in tandem with Class-Balanced Focal Loss. Empirical evaluation establishes that BGE-M3 delivers reliable cross-lingual harmful content detection, ensuring ethical compliance in generative image workflows (Nam et al., 14 Dec 2025).
1. Model Architecture and Initialization
BGE-M3 employs a Transformer encoder backbone with 12 self-attention layers, a hidden size of 768, and 12 attention heads, with approximately 110 million parameters. Initial weights are sourced from a heterogeneous mix of BERT, RoBERTa, and XLM-RoBERTa checkpoints to optimize cross-lingual representation capacity.
Textual inputs are tokenized using a shared Byte Pair Encoding vocabulary (≈50,000 tokens), and embedded as 768-dimensional vectors. Standard positional encodings consistent with the "Attention Is All You Need" framework are applied.
A single-layer classification head, positioned atop the pooled [CLS] token representation, outputs a two-dimensional logit vector. Softmax activation produces class probabilities for subsequent decision logic.
2. Multilingual Data Curation and Labeling
Training and evaluation utilize a multilingual dataset comprising approximately 830,000 samples with the following composition:
- Vietnamese fake-news articles and toxic comments (VFND), alongside Vietnamese legal ("clean") documents.
- English toxic comment and biased news data.
A binary label, , is assigned to each example, with denoting "safe" and representing "harmful/misleading" prompts. The latter class encompasses requests for biased portrayals, hate speech, disinformation, or prompts likely to generate non-consensual or illicit imagery.
The dataset exhibits a notable class imbalance (730,000 safe vs. 100,000 harmful), yielding a 9:1 ratio.
3. Fairness-Aware Optimization and Training Procedures
To address class imbalance and promote model fairness, two methodological elements are employed:
- Balanced Batching: Each mini-batch contains equal numbers of safe and harmful instances, normalizing gradient estimates and counteracting domination by the majority class.
- Class-Balanced Focal Loss: For batch size and instance , define as the ground-truth label and as the softmax probability for the harmful class. The loss is:
where is inversely proportional to class frequency, focusing learning on harmful samples, and is the focusing parameter to de-emphasize well-classified instances.
No adversarial or demographic-specific penalty terms were employed. The balanced-batch sampling and focal loss were sufficient to mitigate skew and enhance minority-class sensitivity.
4. Inference, Thresholding, and Post-Processing
At inference, BGE-M3 computes for the input prompt. The acceptance rule is:
- If , the prompt is rejected and accompanied by an explanatory return based on the most probable detected harmful category.
- If , the prompt passes to the generative model (Hyper-SD) for downstream synthesis.
No ensembling or post-calibration was necessary, as validation-stage probabilities exhibited strong interclass separation.
5. Evaluation Methodology and Quantitative Performance
Evaluation proceeded on a stratified 10% test split from the full corpus. Standard classification metrics were employed:
with TP, FP, and FN computed for the harmful class.
Results obtained:
- Accuracy: 0.8215
- F1-score: 0.8145
- English subset F1: 0.816
- Vietnamese subset F1: 0.813
These results indicate strong cross-lingual robustness. Slightly elevated false negative rates on culturally-specific harm categories highlight the current taxonomy's limitations.
6. Comparative Ablation Studies
Ablation and comparative studies isolate the impact of model and training choices:
| Model Configuration | Fine-Tuned | F1 Score |
|---|---|---|
| BGE-M3 (base, no fine-tuning) | No | 0.1840 |
| PhoBERT-base-v2 (fine-tuned) | Yes | 0.6862 |
| keepitreal/vietnamese-sbert (fine-tuned) | Yes | 0.1587 |
| xlm-roberta-large (fine-tuned) | Yes | 0.0932 |
Domain-specific fine-tuning yields the greatest performance delta (F1 over the pretrained base). Balanced batching and focal weighting each provide an incremental F1 gain of approximately . Changes to backbone architecture have comparatively marginal effect, underscoring the dominant role of curated multilingual data and fairness-aware optimization.
7. Significance and Limitations
BGE-M3 demonstrates the practical viability of lightweight, fairness-aware Transformer-based prompt classifiers for pre-generation ethical content filtering in multilingual generative workflows. It achieves reliable interception of harmful textual inputs (F1 0.81) without sacrificing language or category robustness.
Unaddressed in this iteration are explicit group-based fairness metrics and comprehensive coverage of culturally-nuanced harmful prompt categories. A plausible implication is that extending label taxonomies and incorporating demographic-aware supervision could further enhance harm detection, particularly for rare or context-dependent categories (Nam et al., 14 Dec 2025).