Evolving Knowledge Distillation for Lightweight Neural Machine Translation

Published 11 May 2026 in cs.CL | (2605.09924v1)

Abstract: Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on resource-limited devices. Knowledge distillation (KD) is a promising approach for compressing models, but its effectiveness diminishes when there is a large capacity gap between teacher and student models. To address this issue, we propose Evolving Knowledge Distillation (EKD), a progressive training framework in which the student model learns from a sequence of teachers with gradually increasing capacities. Experiments on IWSLT-14, WMT-17, and WMT-23 benchmarks show that EKD leads to consistent improvements at each stage. On IWSLT-14, the final student achieves a BLEU score of 34.24, narrowing the gap to the strongest teacher (34.32 BLEU) to just 0.08 BLEU. Similar trends are observed on other datasets. These results demonstrate that EKD effectively bridges the capacity gap, enabling compact models to achieve performance close to that of much larger teacher models.Code and models are available at https://github.com/agi-content-generation/EKD.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces an iterative distillation framework that uses a hierarchy of incrementally larger teachers to effectively mitigate capacity gaps and enhance translation performance.
It employs a two-stage process combining junior teacher distillation with KL divergence minimization and senior teacher refinement, integrating curriculum learning for smoother knowledge transfer.
Experimental results on IWSLT and WMT benchmarks show that EKD narrows BLEU gaps to below 1%, outperforming traditional KD and TAKD while maintaining computational efficiency.

Evolving Knowledge Distillation for Compact Neural Machine Translation

Background and Motivation

Rapid advancements in Neural Machine Translation (NMT) driven by transformers have led to models with massive parameter counts, notably M2M-100 and NLLB, ensuring superior translation accuracy across numerous languages. The rising scale, however, constrains deployment on devices with limited compute and memory resources. Knowledge Distillation (KD), wherein a "student" model is trained to mimic a larger "teacher", is widely adopted for such compression. However, the efficacy of traditional KD diminishes markedly as the parameter gap between teacher and student increases; the student struggles to absorb the teacher's complex knowledge, yielding suboptimal translation performance.

Attempts to mitigate this via Teacher Assistant Knowledge Distillation (TAKD)—using intermediate teacher models—help, but suffer notable performance degradation at each stage, limiting the student’s ability to reach the teacher’s capacity. Addressing this core challenge, the paper introduces Evolving Knowledge Distillation (EKD): an iterative distillation framework where a student learns from a hierarchy of teachers with steadily increasing capacities.

Methodology

EKD operates under a progressive KD paradigm, facilitating smoother knowledge transfer and diminishing the capacity disparity. The framework comprises:

Teacher Model Hierarchy: Teachers are organized in ascending order of parameter counts—student, junior teacher, senior teacher, and optionally further larger teachers (e.g., master teacher). Models maintain architectural homogeneity (transformer-based), differing only in scale.
Two-Stage Distillation:
- Stage 1 (Junior Teacher Distillation): The student is distilled from the junior teacher, minimizing the Kullback-Leibler (KL) divergence between output distributions, combined with the task-specific cross-entropy loss.
- Stage 2 (Senior Teacher Refinement): The evolved student undergoes further distillation, now from the senior teacher, again balancing KL divergence and task loss with adjustable hyperparameters.
Curriculum Learning Perspective: EKD can be interpreted as embedding curriculum learning into KD—starting with an easier (junior teacher) target and gradually transitioning to more challenging (senior teacher) objectives as the student’s capacity and knowledge increase.
Extension to Multiple Teachers: The approach is inherently extensible; more hierarchical teacher stages facilitate even finer-grained progressive knowledge transfer, further refining the student’s capabilities.

Experimental Setup

Experiments are conducted on IWSLT-14 (German-English), WMT-23 (English-Czech), and WMT-17 (English-German). All models use transformers, with parameter counts for student (6M–14M), junior teacher (15M–31M), and senior teacher (39M–72M) chosen to be within a factor of about 2–3 between adjacent stages. Models are trained using standard NMT practices: Moses tokenization, BPE subword segmentation, label-smoothed cross-entropy loss, Adam optimizer, and beam search decoding. Evaluation metrics include detokenized BLEU and neural COMET scores.

Results

EKD consistently delivers substantial translation performance improvements across all datasets:

IWSLT-14 De-En: The student distilled via EKD (Tsenior → [Tjunior → S]) achieves a BLEU score of 34.24, merely 0.08 lower than the senior teacher (34.32), a gap of only 0.23%. This narrows the gap compared to traditional KD (student from senior teacher directly; gap of 3.23 BLEU, 9.41%).
WMT-23 En-Cs & WMT-17 En-De: Similar trends observed; EKD narrows the gap between student and teacher from ~11–17% to ~2–4%.
COMET Metric: EKD outperforms both traditional KD and TAKD in terms of semantic fidelity, with improvements in the range of 5.5–9.1% across stages.
TAKD Comparison: EKD surpasses TAKD by 5.8% BLEU, with the student even exceeding the performance of intermediate teacher assistants—a phenomenon rarely achieved in TAKD.

Further experiments demonstrate the scalability of EKD with additional teacher levels: introducing a "master" teacher leads to incremental improvements, although gains plateau as teacher models approach their performance ceiling. The student consistently benefits from exposure to more capable teachers, irrespective of its fixed parameter count.

EKD's computational efficiency is noteworthy. Training costs do not scale dramatically; in fact, EKD incurs lower FLOPs than training the junior teacher from scratch, facilitating practical deployment.

Implications and Future Directions

The results robustly demonstrate that hierarchical, evolving distillation enables compact NMT models to nearly match their large teacher counterparts, fundamentally breaking prior limits imposed by capacity gaps. EKD's curriculum-based, progressive structure significantly enhances knowledge retention and transfer, outperforming both conventional KD and TAKD. This is critical for model deployment in edge environments and latency/bandwidth-constrained settings.

Theoretically, EKD affirms that knowledge transfer effectiveness is a monotonically decreasing function of teacher-student capacity difference. Its staged approach aligns teacher complexity with student learning abilities, optimizing the transfer pathway.

Future research directions include:

Extension to Heterogeneous Architectures: The current validation is limited to transformer-based models; cross-family distillation (e.g., transformer teacher, Mamba student) presents unexplored challenges.
Multi-domain and Continual Learning: EKD could be adapted for domain-adaptive NMT, lifelong learning, and multi-task settings.
Hyperparameter Optimization: Systematic search over distillation trade-off weights and teacher hierarchy structuring may yield further improvements.

Conclusion

Evolving Knowledge Distillation represents a substantial advance in NMT model compression, delivering lightweight models with performance nearly indistinguishable from larger teachers. By progressively bridging capacity gaps through hierarchical teacher sequences, EKD enables efficient translation quality retention and improved deployability. The framework is flexible, computationally efficient, and empirically validated across diverse language pairs and datasets, offering a practical blueprint for the next generation of compact NMT systems (2605.09924).

Markdown Report Issue