Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
107 tokens/sec
Gemini 2.5 Pro Premium
58 tokens/sec
GPT-5 Medium
29 tokens/sec
GPT-5 High Premium
25 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
84 tokens/sec
GPT OSS 120B via Groq Premium
478 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer (2401.16658v3)

Published 30 Jan 2024 in cs.CL and eess.AS

Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder architectures. This work aims to improve the performance and efficiency of OWSM without additional data. We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%. We further reveal the emergent ability of OWSM v3.1 in zero-shot contextual biasing speech recognition. We also provide a model trained on a subset of data with low license restrictions. We will publicly release the code, pre-trained models, and training logs.

Citations (30)

Summary

  • The paper introduces OWSM v3.1, a speech model that employs the E-Branchformer to boost accuracy and achieve up to 25% faster inference.
  • The paper applies methodical data preparation and a piecewise linear learning rate to ensure stable convergence across diverse ASR and translation benchmarks.
  • The paper reports superior performance with lower word error rates and higher BLEU scores compared to previous models and Whisper.

Introduction to OWSM v3.1

The progressive enhancement of speech processing models has been paramount in achieving state-of-the-art results across a variety of speech-related tasks. The transition from the previous Open Whisper-style Speech Model (OWSM) versions to the OWSM v3.1 is a significant leap that showcases substantial improvements in both performance and efficiency without the need for extra training data. Employing the E-Branchformer as its encoder, OWSM v3.1 comes in two scales with 100M and 1B parameters, marking the release of the largest E-Branchformer based speech model to date.

A key differentiator from its predecessors, the new version registers superiority over OWSM v3 on numerous benchmarks, and even outperforms the widely-used Whisper in multiple datasets. Notably, OWSM v3.1's inference speed is up to 25% faster than before, illustrating the model's enhanced processing capabilities.

OWSM v3.1 Enhancements

The architectural switch to E-Branchformer provides enhanced speech modeling capabilities by effectively capturing and integrating local and global contextual information from speech sequences. Modifications in network configurations, such as adjusting the hidden layer sizes and the number of layers, have resulted in a slightly larger yet much faster model compared to OWSM v3 and Whisper counterparts.

In addition to architecture, OWSM v3.1 also benefits from methodical data preparation and an innovative piecewise linear learning rate schedule. This new learning rate approach contributes to stable convergence during training, a challenge that has been overcome without increasing data input.

Experimental Results

A series of rigorous benchmarks reveal OWSM v3.1's performance superiority. Results from English Automatic Speech Recognition (ASR) benchmarks depict that OWSM v3.1 outperforms OWSM v3 in 8 out of 9 test sets and shows a lower Word Error Rate (WER) compared to Whisper's 438K hours of training. Multilingual ASR benchmarks also signify comprehensive improvements, with notable advancements in Chinese and Japanese language error rates.

The Speech Translation tasks evidence enhancement with OWSM v3.1 scoring higher BLEU percentages than OWSM v3 in various test sets. Furthermore, it boasts a faster decoding speed, enhancing its applicability in real-world scenarios. Long-form ASR and language identification also observe considerable advancements over OWSM v3.

Forward-Looking Perspectives

This research demonstrates the role of architectural innovation in uplifting the capabilities of speech processing models. OWSM v3.1 lays a foundation for future exploration, including training a model with free-licensed data, expanding the datasets for broader language support, and honing more efficient speech encoder architectures. Researchers are also encouraged to apply OWSM v3.1 to downstream tasks and continual learning frameworks.

Conclusion

Overall, OWSM v3.1 signifies an evolution in the creation of open-source, high-performing, and efficient speech foundation models. By publicly releasing the model weights and training logs, the work fosters transparency and lays the groundwork for the broader speech processing community to benefit from these improvements, propelling the open science initiative forward.