Mixed-Distil-BERT: Efficient Compact Transformer Models

Updated 10 December 2025

Mixed-Distil-BERT is a family of compact, parameter-efficient models that use mixed-data, mixed-vocabulary, and mixed-precision techniques to support code-mixed and multilingual tasks.
It applies a two-phase pre-training strategy—first on trilingual data then on synthetic code-mixed corpora—to achieve competitive performance with low resource requirements.
Mixed-precision quantization and embedding alignment significantly reduce model size and inference latency, enabling deployment in resource-constrained environments.

Mixed-Distil-BERT refers to a family of compact BERT-based models constructed via mixed-data, mixed-vocabulary, and mixed-precision training protocols, designed to address either code-mixed language understanding or efficient deployment on resource-constrained systems. These models maintain the core transformer architecture of DistilBERT but introduce specific innovations in pre-training, vocabulary alignment, and quantization, yielding highly parameter-efficient alternatives for multilingual and code-mixed tasks, as well as general language understanding within memory and latency constraints.

1. Model Architectures and Parameterization

In all documented instantiations, Mixed-Distil-BERT inherits the DistilBERT baseline: six encoder layers, each with twelve self-attention heads, hidden size 768, intermediate feed-forward dimension 3,072, and GELU nonlinearities. This configuration yields approximately 66 million parameters—one third of the mBERT model (∼177 M) and one-tenth of XLM-R (∼550 M) (Raihan et al., 2023).

For Mixed-Vocabulary Training targeting ultra-small models, the architecture is further reduced to six or twelve transformer layers with hidden size 256, four attention heads, and a substantially pruned vocabulary (≈5,000 tokens), minimizing the high-parameter input embedding bottleneck to achieve total size as low as 6.2 M (Zhao et al., 2019).

Mixed-precision quantization protocols further compress DistilBERT by partitioning each weight matrix into granular subgroups and searching over heterogeneous bit-assignments (0, 2, 4, or 8 bits), yielding storage footprints of 10 MB—up to 16× smaller than float DistilBERT with only minor drops in benchmark performance (Zhao et al., 2021).

2. Pre-training Procedures and Data Synthesis

Mixed-Distil-BERT for code-mixed tasks employs a two-phase pre-training strategy (Raihan et al., 2023):

Trilingual Pre-training (Tri-Distil-BERT): Starting from English-only DistilBERT, additional MLM pre-training is conducted over Bangla and Hindi subsets of the OSCAR corpus, each contributing ∼100 million monolingual sentences.
Code-mixed Pre-training: The resulting Tri-Distil-BERT checkpoint is further trained on a synthetic corpus of 560,000 English–Bangla–Hindi sentences generated via the random code-mixing algorithm of Krishnan et al. (2021), which recombines translated spans from parallel Yelp Polarity sentences to produce realistic multi-script utterances.

Pre-training throughout employs the masked language modeling objective: $\mathcal{L}_{MLM} = -\sum_{i\in M} \log P\bigl(x_i \mid X_{\setminus M}\bigr)$ with a masking probability of 15%. No next-sentence prediction term is used. Key hyperparameters include batch size 16, AdamW optimizer (β₁=0.9, β₂=0.999, weight decay=0.01), learning rate $5\times10^{-5}$ , and sequence length 512. Trilingual phase takes ∼18 hours and code-mixed phase ∼6 hours on NVIDIA A100 GPUs; final perplexity after five epochs is 2.12.

3. Compression via Mixed-Vocabulary Alignment

The embedding bottleneck in BERT derivatives is directly targeted in Mixed-Vocabulary Training (Zhao et al., 2019). Because traditional knowledge distillation presumes identical vocabularies between teacher and student, the method introduces two key stages:

Stage 1 (Embedding Alignment): During teacher training, input sentences are tokenized per-word via either teacher (∼30K tokens) or student (∼5K tokens) vocabularies (with probability $p_{SV}=0.5$ ). The MLM loss is computed on both, facilitated by a learned projection mapping student embeddings to teacher dimension. This aligns subword semantics across vocabularies.
Stage 2 (Student-only Fine-tuning): Teacher weights and vocabularies are discarded; the student model is initialized with the aligned embedding matrix and further tuned only on student vocabulary tokens.

This approach yields student models with dramatic reductions in parameter count and disk size, retaining performance on GLUE and SNIPS benchmarks close to larger distilled models.

Model	Params	GLUE Avg	Disk Size
BERT-LARGE	340 M	90.3	—
DistilBERT₄	52 M	84.2	—
Mixed-Distil-BERT	6.2 M	84.3	∼20 MB

4. Mixed-Precision Quantization and Pruning

Automatic mixed-precision quantization search leverages Differentiable Neural Architecture Search (DNAS) to group each DistilBERT weight matrix or attention projection into subgroups subject to learnable categorical distributions over bit-width assignments, including 0-bit (i.e., pruning) (Zhao et al., 2021). Key computational steps include:

Quantization function: $Q(X; s, b) = \frac{\mathrm{round}(X \cdot s)}{s}$ with adaptive scale $s$ per group.
Bi-level optimization: Weights $\omega$ are optimized to minimize the task loss; bit assignments $\mathcal{O}$ are optimized to minimize validation loss plus a log-barrier on total model size.
Empirical settings: Bit search space $\{0, 2, 4, 8\}$ ; embeddings/activations fixed at 8 bits; subgroup count $G=128$ per matrix for balance between granularity and compute.

Final models achieve up to $16\times$ reduction in disk size (DistilBERT: $162$ MB $\to$ $10$ MB) with negligible accuracy trade-offs on canonical benchmarks.

Task	Float DistilBERT	Uniform 8-bit	Mixed-Distil-BERT
SST-2	91.3% (162 MB)	90.5%	89.7% (10 MB)
MNLI-m	82.2%	81.0%	78.0%
SQuAD F1	85.8%	83.0%	78.3%

5. Downstream Task Fine-tuning and Performance Evaluation

Mixed-Distil-BERT is systematically evaluated on synthetic code-mixed English–Bangla–Hindi datasets for three text classification tasks (Raihan et al., 2023):

Multi-label emotion detection (6 labels, 100K examples)
Binary sentiment analysis (100K examples)
Offensive language identification (100K examples)

Standard BERT fine-tuning is used (batch size 16, learning rate $2\times10^{-5}$ , 3–4 epochs, no further augmentation). Weighted F1 scores are competitive with larger baselines, particularly on code-mixed data:

Emotion detection: Mixed-Distil-BERT F1 = 0.50 vs. mBERT 0.49, XLM-R 0.51
Sentiment: Mixed-Distil-BERT F1 = 0.70 vs. mBERT 0.74, XLM-R 0.77
Offensive detection: Mixed-Distil-BERT F1 = 0.87 vs. mBERT/XLM-R 0.88

Efficiency profiles are favorable: Mixed-Distil-BERT fine-tunes 2.5× faster than XLM-R and delivers 2.5× lower inference latency per 1,000 sentences.

6. Ablation Studies and Practitioner Recommendations

Direct comparison of Tri-Distil-BERT (trilingual but not code-mixed) and Mixed-Distil-BERT isolates the impact of code-mixed pre-training: consistent gains of +1–2 pp weighted F1 on all tasks (Raihan et al., 2023). Ablations in the mixed-vocabulary regime confirm that jointly aligning embeddings via Stage 1 outperforms naïve training or forcibly tied embedding matrices (Zhao et al., 2019).

Practitioner guidelines, according to documented results:

For code-mixed contexts or low-resource languages, two-stage pre-training (monolingual $\to$ code-mixed) is strongly recommended.
For ultra-light models ( $<$ 10 M params), use six layers, hidden size 192–256, vocabulary $\approx$ 5,000, mixed-vocab alignment with $p_{SV}=0.5$ .
For resource-constrained deployment, combine mixed-precision quantization and pruning with DistilBERT for 10–20 MB models with minimal accuracy loss.

7. Context, Limitations, and Implications

Mixed-Distil-BERT variants represent distinct but related approaches to efficient language modeling. The code-mixed protocol advances multilingual representation for realistic text classification in South Asian settings, while mixed-vocabulary and mixed-precision protocols yield general-purpose students for mobile or edge applications. Reported limitations include decreased performance on very long-input tasks due to coarser vocabulary; lack of explicit alignment on transformer outputs; and dependence on synthetic code-mixed data for robust generalization. These approaches are orthogonal to other compression techniques (layer-drop, quantization, bottleneck layers) and may be stacked for further gains.

A plausible implication is that code-mixed or mixed-vocabulary models, when combined with sub-group level quantization/search, may broaden the operational domains of pretrained transformers to low-latency and low-resource regimes without substantial loss in accuracy (Raihan et al., 2023, Zhao et al., 2019, Zhao et al., 2021).