- The paper presents a novel approach to distill the Whisper model via large-scale pseudo labelling, significantly reducing computational cost.
- It employs a WER-based filtering method to select high-quality labels and achieves 5.8x faster inference with 51% fewer parameters.
- Results indicate robust performance on long-form audio with minimal accuracy loss, making it well-suited for resource-constrained environments.
Overview of "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling"
The paper introduces "Distil-Whisper," a distilled variant of the Whisper model, aimed at reducing the computational burden of large Automatic Speech Recognition (ASR) models while maintaining robust performance. The approach leverages pseudo-labelling alongside a comprehensive dataset to perform knowledge distillation effectively, resulting in a model that is significantly smaller and faster than its predecessor.
Methodology
The key innovation in this work stems from applying pseudo-labelling to create a vast open-source dataset for model distillation. The Whisper model, known for its robustness across diverse acoustic conditions, serves as the teacher model. The pseudo-labels generated by Whisper help train the distilled model, Distil-Whisper, by selecting high-quality labels based on a simple word error rate (WER) heuristic.
Knowledge Distillation Approach:
- Shrink and Fine-Tune: The distilled model is initialized by choosing maximally spaced layers from the Whisper model, followed by fine-tuning on the new pseudo-labelled dataset.
- Pseudo-Labelling: The initial ground truth labels are replaced with pseudo-labels generated by the teacher model. This pseudo-labelling helps maintain consistency in transcription formatting.
- WER Filtering: To ensure high-quality labels, data with a high WER between the generated and ground truth labels is filtered out, improving the reliability of the distilled model.
Results and Performance
Performance Metrics:
- Distil-Whisper achieves a significant reduction in parameters and computational cost—5.8 times faster and 51% fewer parameters than Whisper.
- The model maintains WER performance within 1% of the original Whisper model on out-of-distribution test data.
Robustness and Speed:
- The distilled model can be effectively paired with the Whisper model for speculative decoding, achieving a twofold speed-up while ensuring identical output to Whisper.
- It outperforms the original model in handling long-form audio, showing less propensity for hallucination errors.
Implications and Future Directions
This work has several implications for future AI developments, particularly in resource-constrained environments where deploying large ASR models is impractical. The success of Distil-Whisper demonstrates the potential for large-scale pseudo-labelling and simple heuristic filtering to enhance model robustness without extensive parameter inflation.
The open availability of the training and inference code fosters further exploration and adaptation in the field of model distillation and ASR systems. Future research could explore integrative methodologies that extend beyond pseudo-labelling, such as incorporating cross-modal knowledge distillation or leveraging multi-modal datasets to refine ASR models further.
Conclusion
The paper presents a well-justified approach to reducing model size and increasing inference speed without sacrificing accuracy and robustness. Distil-Whisper represents a significant step forward in developing efficient, practical ASR systems suitable for diverse and demanding environments. The strategies and methodologies set forth provide a strong foundation for future innovations in model distillation.