Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data (2309.13876v3)

Published 25 Sep 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.

Citations (26)

View on Semantic Scholar

Summary

The paper introduces OWSM, an open-source framework that replicates Whisper’s multilingual, multitask training using publicly available data.
The paper extends Whisper’s capabilities by enabling any-to-any speech translation beyond the original any-to-English limitation.
The paper demonstrates efficient training techniques that yield competitive performance on benchmarks despite using significantly less training data.

An Analytical Overview of "Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data"

The paper under examination presents the work on developing an Open Whisper-style Speech Model (OWSM), a reproduction of OpenAI's Whisper model, utilizing an open-source toolkit and publicly available data. OpenAI's Whisper is renowned for its proficiency in multilingual multitask speech processing tasks, including automatic speech recognition (ASR) and speech translation (ST). Notably, Whisper's development pipeline remains proprietary, posing challenges for the broader research community in reproducing and extending its capabilities. This paper addresses these constraints by making available the OWSM framework, encouraging advancement and transparency in large-scale speech model pre-training.

Key Contributions

Open Reproduction of Whisper-Style Training: The authors introduced OWSM, an open-source alternative to Whisper, achieved by mirroring its architecture and training methodologies. By employing public datasets and open-source tools, OWSM serves as a framework for replicating Whisper's multilingual, multitask functionality.
Support for Additional Translation Directions: OWSM not only mirrors Whisper’s core functionalities but extends them by supporting any-to-any speech translation, as opposed to Whisper’s any-to-English limitation.
Efficient Training Techniques: The methodology includes leveraging strategies for improved training efficiency and convergence, such as warm initialization and joint CTC/attention mechanisms, which enhance the stability and performance of the training process.
Open Science Commitment: By releasing all scripts, models, and logs associated with OWSM's training process, the authors bolster the ethos of open science, enabling broader community access to resources that facilitate further research and innovation.

Experimental Findings

The authors extensively evaluated OWSM against Whisper and reported noteworthy findings. While OWSM models were trained on significantly less data than Whisper (180k vs. 680k hours), they still demonstrated competitive performance across various benchmarks.

English ASR: OWSM shows competitive results on benchmarks like LibriSpeech, even surpassing Whisper in certain contexts, despite having access to a smaller dataset.
Multilingual ASR: Although Whisper generally outperforms OWSM due to its larger training corpus, OWSM shows similar or superior results in languages where it has more data (e.g., Japanese).
Speech Translation and Language Identification: OWSM supports multilingual translation and has demonstrated successful language identification accuracy, benefiting from access to a broad spectrum of language data.

Implications and Future Directions

This work significantly contributes to the democratization of speech model research. It opens new vistas for exploring advanced speech processing techniques, including robust multitask learning frameworks and efficient model architectures. Practically, the availability of such a comprehensive toolkit will likely spur developments in real-world speech applications, making state-of-the-art technologies accessible beyond industrial confines.

Future research may focus on scaling the model akin to Whisper by collecting more diverse datasets or employing advanced architectures for computational efficiency. Exploration into additional speech processing tasks and model compression techniques for deployability will further align OWSM with practical use cases. The model also offers a testbed for investigating broader topics, such as continual learning, data imbalance, and model robustness.

Conclusion

OWSM places substantial importance on the potential of open-source efforts to replicate and expand upon proprietary technology, underscoring a commitment to open science and collaborative improvement. The project not only mirrors the functionality of established models like Whisper but also provides a foundation that the research community can build upon, ensuring that innovation in speech processing remains an inclusive endeavor.

PDF Markdown

Related Papers

Tweets

https://twitter.com/reach_vb/status/1727065880918409674

https://twitter.com/arankomatsuzaki/status/1706520625789964506

https://twitter.com/pengyf21/status/1706499518433255679

YouTube

Show All Videos