- The paper introduces OWSM v3.1, a speech model that employs the E-Branchformer to boost accuracy and achieve up to 25% faster inference.
- The paper applies methodical data preparation and a piecewise linear learning rate to ensure stable convergence across diverse ASR and translation benchmarks.
- The paper reports superior performance with lower word error rates and higher BLEU scores compared to previous models and Whisper.
Introduction to OWSM v3.1
The progressive enhancement of speech processing models has been paramount in achieving state-of-the-art results across a variety of speech-related tasks. The transition from the previous Open Whisper-style Speech Model (OWSM) versions to the OWSM v3.1 is a significant leap that showcases substantial improvements in both performance and efficiency without the need for extra training data. Employing the E-Branchformer as its encoder, OWSM v3.1 comes in two scales with 100M and 1B parameters, marking the release of the largest E-Branchformer based speech model to date.
A key differentiator from its predecessors, the new version registers superiority over OWSM v3 on numerous benchmarks, and even outperforms the widely-used Whisper in multiple datasets. Notably, OWSM v3.1's inference speed is up to 25% faster than before, illustrating the model's enhanced processing capabilities.
OWSM v3.1 Enhancements
The architectural switch to E-Branchformer provides enhanced speech modeling capabilities by effectively capturing and integrating local and global contextual information from speech sequences. Modifications in network configurations, such as adjusting the hidden layer sizes and the number of layers, have resulted in a slightly larger yet much faster model compared to OWSM v3 and Whisper counterparts.
In addition to architecture, OWSM v3.1 also benefits from methodical data preparation and an innovative piecewise linear learning rate schedule. This new learning rate approach contributes to stable convergence during training, a challenge that has been overcome without increasing data input.
Experimental Results
A series of rigorous benchmarks reveal OWSM v3.1's performance superiority. Results from English Automatic Speech Recognition (ASR) benchmarks depict that OWSM v3.1 outperforms OWSM v3 in 8 out of 9 test sets and shows a lower Word Error Rate (WER) compared to Whisper's 438K hours of training. Multilingual ASR benchmarks also signify comprehensive improvements, with notable advancements in Chinese and Japanese language error rates.
The Speech Translation tasks evidence enhancement with OWSM v3.1 scoring higher BLEU percentages than OWSM v3 in various test sets. Furthermore, it boasts a faster decoding speed, enhancing its applicability in real-world scenarios. Long-form ASR and language identification also observe considerable advancements over OWSM v3.
Forward-Looking Perspectives
This research demonstrates the role of architectural innovation in uplifting the capabilities of speech processing models. OWSM v3.1 lays a foundation for future exploration, including training a model with free-licensed data, expanding the datasets for broader language support, and honing more efficient speech encoder architectures. Researchers are also encouraged to apply OWSM v3.1 to downstream tasks and continual learning frameworks.
Conclusion
Overall, OWSM v3.1 signifies an evolution in the creation of open-source, high-performing, and efficient speech foundation models. By publicly releasing the model weights and training logs, the work fosters transparency and lays the groundwork for the broader speech processing community to benefit from these improvements, propelling the open science initiative forward.