PhoBERT-CNN Hybrid
- The paper introduces a hybrid model that integrates PhoBERT with a multi-channel Text-CNN to effectively capture both contextual and localized lexical features for Vietnamese hate speech detection.
- It employs a rigorous two-phase data cleaning and Easy Data Augmentation strategy to address orthographic noise and class imbalance, significantly boosting macro-F1 scores.
- Empirical evaluations on ViHSD and HSD-VLSP benchmarks, along with real-time streaming deployment, demonstrate its superior performance and practical applicability in content moderation.
The PhoBERT-CNN Hybrid is a composite neural architecture that leverages the monolingual Vietnamese transformer model PhoBERT as an embedding encoder, paired with a multi-channel convolutional (Text-CNN) classification head. This approach is designed for high-accuracy hate speech detection (HSD) in the Vietnamese language, addressing distinctive challenges found in social media text such as orthographic noise, morphological complexity, and severe class imbalance. The architecture achieves state-of-the-art macro-averaged F1-scores on leading Vietnamese HSD benchmarks and is deployable in real-time streaming pipelines (Tran et al., 2022).
1. Hybrid Model Architecture
The PhoBERT-CNN pipeline is structured as follows: Given a pre-processed Vietnamese token sequence , the PhoBERT transformer generates contextualized word embeddings , where is the hidden size (e.g., for PhoBERT). These embeddings are passed to parallel 1D convolution channels (with different kernel sizes , each with 32 filters). For each channel, filters and biases produce feature maps via
Max-pooling over time yields pooled feature vectors , which are concatenated as , then mapped by a fully connected softmax layer to produce the label distribution ( classes: CLEAN, OFFENSIVE, HATE). The loss is cross-entropy,
This architecture exploits PhoBERT’s monolingual vocabulary and transformer context encoding, while the CNN layer distills localized lexical features relevant to HSD (Tran et al., 2022).
2. Data Pre-processing and Augmentation
A two-stage cleaning pipeline is employed to address the orthographic and lexical noise prevalent in Vietnamese social media:
- Phase 1 (character-level): Lowercasing, redundant whitespace/hyperlink removal, Unicode normalization (to UTF-8 NFC), deletion of repeated characters, and normalization of Vietnamese diacritic placement.
- Phase 2 (token-level): Vietnamese word segmentation (VnCoreNLP for PhoBERT-based models), mapping of teencode (slang/abbreviation) to standard forms, and stopword removal using a published Vietnamese stopword list.
To address class imbalance (e.g., CLEAN ≈ 83%, OFFENSIVE ≈ 7%, HATE ≈ 10% in ViHSD), Easy Data Augmentation (EDA) techniques [Wei & Zou, 2019] are applied to minority-class examples: synonym substitution, random insertion, swap, and deletion, with replacement ratio. Augmentation is performed only for minority classes to balance the training set. All models are subsequently retrained on this balanced, enriched corpus (Tran et al., 2022).
3. Training Setup and Hyperparameters
PhoBERT-CNN employs:
- Batch size: $64$
- Learning rate:
- Optimizer: AdamW (linear warmup)
- Weight decay: $0.01$
- Epochs: $3$ (ViHSD); 5-fold cross-validation (HSD-VLSP)
- Dropout (CNN layer): $0.4$
- Convolutional channels: , 32 filters each
- No additional class weighting is used due to prior data augmentation
Baseline models include NB, SVM, RF (on TF-IDF features, , with hyperparameter tuning) and deep networks (Text-CNN, Bi-LSTM with fastText/PhoW2V initialization). Non-transformer architectures are trained for up to 10 epochs with batch size $64$, dropout $0.4$, and early stopping (Tran et al., 2022).
4. Empirical Evaluation
The PhoBERT-CNN model is benchmarked on ViHSD (33,400 comments, 7:1:2 split) and HSD-VLSP (20,345 comments). Both show severe class imbalance. Evaluation metrics are macro-F1 and accuracy. Results:
| Model | ViHSD F1 | ViHSD Acc | HSD-VLSP F1 | HSD-VLSP Acc |
|---|---|---|---|---|
| Text-CNN+fastText | 61.67 | 86.98 | 85.76 | 97.14 |
| Bi-LSTM+PhoW2V | 62.66 | 85.99 | 84.04 | 96.79 |
| BERT-large | 60.29 | 84.52 | 85.41 | 96.19 |
| RoBERTa-large | 61.49 | 83.04 | 85.79 | 96.95 |
| XLM-R-large | 62.38 | 83.62 | 86.57 | 97.15 |
| PhoBERT-large | 63.51 | 87.13 | 86.68 | 97.58 |
| BERT-CNN | 61.26 | 85.90 | 86.37 | 96.17 |
| RoBERTa-CNN | 62.47 | 84.54 | 86.48 | 96.38 |
| XLM-R-CNN | 63.34 | 85.48 | 88.53 | 96.92 |
| PhoBERT-CNN | 67.46 | 87.76 | 98.45 | 98.59 |
Ablation confirms PhoBERT yields ≈3% macro-F1 gain over multilingual baselines, the CNN layer further adds 1–4%, EDA augmentation provides 3–8%, and two-phase cleaning yields 4–7% improvement relative to simpler normalizations. Error analysis indicates most misclassifications occur on ambiguous or sarcastic content (Tran et al., 2022).
5. Real-Time Streaming Application
PhoBERT-CNN is operationalized within an Apache Spark Structured Streaming pipeline (Spark 3.1.1). The system ingests live YouTube comments via TCP, applies the two-phase cleaning, processes text through the PhoBERT-CNN model, and writes predictions to Parquet storage for monitoring. A web interface (SparkSQL + REST) enables real-time visualization. On hardware comprising Intel i7 CPU, 16 GB RAM, and NVIDIA GTX 1650 Ti, the pipeline achieves throughput of 0.64 comments/sec (1.56 s per comment), with accuracy of 82.02% and macro-F1 of 58.19% on 500 annotated streaming samples. Inference latency is dominated by PhoBERT processing; reducing this bottleneck requires model distillation or batching (Tran et al., 2022).
6. Strengths, Limitations, and Future Work
The model’s monolingual transformer encoder provides superior capture of Vietnamese morphological and idiomatic nuances compared to multilingual alternatives. The CNN decoder efficiently extracts local lexical patterns, and the strict two-phase cleaning pipeline plus minority-only data augmentation are effective in mitigating class imbalance without costly annotation.
Limitations are present. PhoBERT-CNN's error profile is dominated by failures on sarcastic or highly context-dependent speech; the architecture does not explicitly model discourse context. Latency (~1.5s per comment) may be excessive for time-sensitive applications, motivating exploration of distilled models (e.g., DistilPhoBERT). Explainability is limited, with potential for incorporating attention-based rationale extraction frameworks such as HateXplain. Extension to span-level detection and multi-aspect annotation (e.g., target specificity, severity) remains an open area. Finally, additional streaming benchmarks, including high-throughput and fault-tolerance, would inform scalability for production deployment (Tran et al., 2022).
In summary, the PhoBERT-CNN hybrid, augmented by rigorous Vietnamese-specific pre-processing and targeted data augmentation, establishes new state-of-the-art performance for Vietnamese hate speech detection, and demonstrates feasibility for real-time content moderation while highlighting avenues for improved efficiency and interpretability (Tran et al., 2022).