Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling (2311.00430v1)

Published 1 Nov 2023 in cs.CL, cs.SD, and eess.AS

Abstract: As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sanchit Gandhi (6 papers)
  2. Patrick von Platen (15 papers)
  3. Alexander M. Rush (115 papers)
Citations (37)

Summary

  • The paper presents a novel approach to distill the Whisper model via large-scale pseudo labelling, significantly reducing computational cost.
  • It employs a WER-based filtering method to select high-quality labels and achieves 5.8x faster inference with 51% fewer parameters.
  • Results indicate robust performance on long-form audio with minimal accuracy loss, making it well-suited for resource-constrained environments.

Overview of "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling"

The paper introduces "Distil-Whisper," a distilled variant of the Whisper model, aimed at reducing the computational burden of large Automatic Speech Recognition (ASR) models while maintaining robust performance. The approach leverages pseudo-labelling alongside a comprehensive dataset to perform knowledge distillation effectively, resulting in a model that is significantly smaller and faster than its predecessor.

Methodology

The key innovation in this work stems from applying pseudo-labelling to create a vast open-source dataset for model distillation. The Whisper model, known for its robustness across diverse acoustic conditions, serves as the teacher model. The pseudo-labels generated by Whisper help train the distilled model, Distil-Whisper, by selecting high-quality labels based on a simple word error rate (WER) heuristic.

Knowledge Distillation Approach:

  • Shrink and Fine-Tune: The distilled model is initialized by choosing maximally spaced layers from the Whisper model, followed by fine-tuning on the new pseudo-labelled dataset.
  • Pseudo-Labelling: The initial ground truth labels are replaced with pseudo-labels generated by the teacher model. This pseudo-labelling helps maintain consistency in transcription formatting.
  • WER Filtering: To ensure high-quality labels, data with a high WER between the generated and ground truth labels is filtered out, improving the reliability of the distilled model.

Results and Performance

Performance Metrics:

  • Distil-Whisper achieves a significant reduction in parameters and computational cost—5.8 times faster and 51% fewer parameters than Whisper.
  • The model maintains WER performance within 1% of the original Whisper model on out-of-distribution test data.

Robustness and Speed:

  • The distilled model can be effectively paired with the Whisper model for speculative decoding, achieving a twofold speed-up while ensuring identical output to Whisper.
  • It outperforms the original model in handling long-form audio, showing less propensity for hallucination errors.

Implications and Future Directions

This work has several implications for future AI developments, particularly in resource-constrained environments where deploying large ASR models is impractical. The success of Distil-Whisper demonstrates the potential for large-scale pseudo-labelling and simple heuristic filtering to enhance model robustness without extensive parameter inflation.

The open availability of the training and inference code fosters further exploration and adaptation in the field of model distillation and ASR systems. Future research could explore integrative methodologies that extend beyond pseudo-labelling, such as incorporating cross-modal knowledge distillation or leveraging multi-modal datasets to refine ASR models further.

Conclusion

The paper presents a well-justified approach to reducing model size and increasing inference speed without sacrificing accuracy and robustness. Distil-Whisper represents a significant step forward in developing efficient, practical ASR systems suitable for diverse and demanding environments. The strategies and methodologies set forth provide a strong foundation for future innovations in model distillation.

Youtube Logo Streamline Icon: https://streamlinehq.com