Multi-BERT Ensemble Approach

Updated 26 December 2025

The paper demonstrates that integrating domain adaptation with diverse fine-tuning yields up to a 3.13 pp improvement in F1 score.
It employs soft-voting, hard voting, and weighted aggregation to combine predictions from models trained with varied sequence lengths and configurations.
The strategy enhances robustness and generalization in classification tasks, though it requires higher computational resources during inference.

A Multi-BERT Ensemble Approach refers to the strategy of leveraging multiple independently fine-tuned BERT (Bidirectional Encoder Representations from Transformers) models for improved prediction in classification or structured prediction tasks. This paradigm encompasses a set of architectural, procedural, and inference-time mechanisms that systematically combine predictions from several BERT-based models, often with variations in pretraining, architectural instantiations, fine-tuning schemes, or data views. The core insight is to exploit model-level diversity—arising from stochastic optimization, domain adaptation, input transformation, or BERT-variant heterogeneity—to achieve measurable gains in robustness and accuracy, typically via probability aggregation or voting rules at inference.

1. Core Architectural Foundations

The Multi-BERT Ensemble strategy is instantiated by starting from a strong BERT backbone, such as ArabicBERT—a base model with 12 Transformer encoder layers, hidden size 768, and 12 self-attention heads, pretrained on 93 GB of general Arabic web text using the canonical Masked Language Modeling (MLM) and Next-Sentence Prediction objectives (Talafha et al., 2020). Domain adaptation is achieved by additional MLM pretraining on large volumes of in-domain, unlabeled data (e.g., 10M unlabeled tweets for dialect identification) for a specified number of epochs (3 in this instance), which yields an adapted BERT checkpoint (labeled Multi-Dialect Arabic-BERT in the case study).

Models in the ensemble are then fine-tuned in parallel, each initialized from the adapted checkpoint. The fine-tuning is conducted on identical labeled data but may differ along hyperparameters (e.g., sequence length {80, 90, 100, 250}), random seeds, or input formats. Each model is paired with a classification head applied to the pooled [CLS] representation, subjected to dropout, and finally mapped through a dense layer and softmax for multi-class tasks.

2. Ensembling Methods and Aggregation Formulations

Multi-BERT ensemble methods fall into several classes, all supported in the literature (Talafha et al., 2020, Mittal et al., 2021, Adams et al., 2021):

Soft-voting (probability averaging): Each model outputs a probability vector $p_i(c|x)$ for class $c$ on input $x$ . The ensemble prediction is given by:

$\hat{y}(x) = \arg\max_{c} \frac{1}{N}\sum_{i=1}^N p_i(c|x)$

where $N$ is the number of ensemble members.

Majority (hard) voting: Each model predicts a hard label; the final output is the most frequent label across models.
Weighted aggregation: Non-uniform weights $w_i$ (possibly derived from validation performance) are applied:

$\hat{y}(x) = \arg\max_{c} \sum_{i=1}^N w_i p_i(c|x)$

with $\sum_i w_i = 1$ .

Stacked generalization: Concatenate the models’ predictions or representations and train a meta-learner (e.g., linear regression, neural net) to produce the final output (Mnassri et al., 2022, Krishnan, 2023).

In (Talafha et al., 2020), the element-wise averaging rule with uniform weights ( $w_i = 1/4$ ) over four fine-tuned model instances is used, followed by an argmax to produce the predicted class.

3. Fine-Tuning Procedures and Regularization

Ensemble diversity is generated by varying the fine-tuning configuration. In the NADI setup (Talafha et al., 2020), each ensemble member is fine-tuned for 3 epochs with identical learning rate ( $3.75 \times 10^{-5}$ , Adam), batch size 16, and the ArabicBERT WordPiece tokenizer, but with four distinct maximum sequence lengths. All gradients are backpropagated through the entire network, and dropout (typically $p=0.1$ ) is applied after the [CLS] embedding. This ensures that each model is exposed to different effective context windows, introduces randomness through SGD dynamics, and benefits from complementary representations.

Uniform hyperparameters across ensemble members preserve the validity of aggregation, while sequence-length variation and random initialization enhance model-level variance.

4. Empirical Impact and Ablation Findings

Table: Macro-averaged F1 on NADI development set (Talafha et al., 2020)

Model	F1 (%)
ArabicBERT (no domain adaptation) single	24.45
ArabicBERT ensemble (4 seq-lens)	24.92
Multi-Dialect Arabic-BERT (domain-adapted) single	26.00
Multi-Dialect Arabic-BERT ensemble	27.58

The synthetic ablation described in (Talafha et al., 2020) shows:

Domain-adaptive pretraining yields a +1.55 pp gain (24.45 → 26.00).
Ensembling alone over base ArabicBERT yields +0.47 pp.
Combination (domain adaptation + ensembling) results in a total gain of +3.13 pp.
Test set (micro-F1) best ensemble: 26.78%.

The data support that while both architectural adaptation and ensembling contribute independently to improved metrics, their combination is supra-additive. A plausible implication is that model-level error patterns are partially independent; thus, averaging reduces variance and exploits complementary generalization.

5. Computational Costs and Practical Recommendations

Ensembling with $N$ BERT models increases inference latency and GPU memory by a factor of $N$ , as each member requires an independent forward pass (Talafha et al., 2020). This is especially salient when $N=4$ , with batch size 16, and large input lengths. Parallelization is recommended for latency-critical deployments.

To generalize the method to other languages or domains:

Start from a suitable pretrained BERT or mBERT for the target domain.
Further pretrain on in-domain unlabeled data for several epochs with MLM.
Fine-tune multiple runs, varying sequence length or other fine-tuning hyperparameters.
Aggregate predictions by averaging softmax probabilities.

Inference batches can be processed sequentially or in parallel to mitigate GPU limitations. The method is particularly suited when data or pretraining corpora are heterogeneous or where domain adaptation is key, as in dialect identification.

6. Limitations and Generality

While Multi-BERT ensembles improve F1 and can outperform advanced single-model baselines, the gains are bounded by model complementarity. The approach multiplies inference time and memory by $N$ , restricting application in resource-constrained environments. No advanced weighting or calibration was used in (Talafha et al., 2020); all weights were uniform, reflecting that in this context further tuning yielded no substantial improvements.

Overfitting risk remains low due to the decorrelation of error patterns among models driven by domain adaptation and hyperparameter heterogeneity; however, very large $N$ may yield diminishing returns—this suggests optimal $N$ is modest. The approach scales naturally to other sequence classification problems, especially in under-resourced language domains, with similar regularization/batching caveats.

7. Relationship to Broader Ensemble Practice

The Multi-BERT Ensemble approach aligns structurally with classic ensemble learning (bagging, boosting, stacking) but adapts principles to deep transformer architectures. Unlike boosting, which updates weights between rounds (Huang et al., 2020), or stacking with a meta-learner (Mnassri et al., 2022), the method in (Talafha et al., 2020) is based on uniformly averaging outputs from models trained on fixed data and varied configurations. The broader ensemble paradigm corroborates that aggregation of independently-trained LLMs delivers consistent—though computationally expensive—accuracy gains in diverse real-world NLP tasks.