fairseq S2T: Fast Speech-to-Text Modeling with fairseq (2010.05171v2)

Published 11 Oct 2020 in cs.CL and eess.AS

Abstract: We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation. It follows fairseq's careful design for scalability and extensibility. We provide end-to-end workflows from data pre-processing, model training to offline (online) inference. We implement state-of-the-art RNN-based, Transformer-based as well as Conformer-based models and open-source detailed training recipes. Fairseq's machine translation models and LLMs can be seamlessly integrated into S2T workflows for multi-task learning or transfer learning. Fairseq S2T documentation and examples are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.

Authors (7)

Changhan Wang (46 papers)
Yun Tang (42 papers)
Xutai Ma (23 papers)
Anne Wu (11 papers)
Sravya Popuri (18 papers)
Dmytro Okhonko (11 papers)
Juan Pino (51 papers)

Citations (246)

View on Semantic Scholar

Summary

Overview of "fairseq S2T: Fast Speech-to-Text Modeling with fairseq"

The paper presents fairseq S2T, a robust extension within the fairseq framework tailored for speech-to-text (S2T) modeling. The research focuses on enhancing tasks such as automatic speech recognition (ASR) and speech-to-text translation (ST). The emphasis is on scalability and extensibility, drawing from fairseq's architectural strength in machine translation and LLMs.

Key Features and Models

fairseq S2T integrates a variety of state-of-the-art model architectures, including RNN-based, Transformer-based, and Conformer-based models. The toolkit supports comprehensive workflows from data pre-processing to model training, including both offline and online inference capabilities. A notable aspect is the seamless integration of machine translation models and LLMs for multi-task learning and transfer learning scenarios.

The system is compatible with diverse speech processing tasks such as end-to-end ASR and ST. It also incorporates latest advancements such as the Connectionist Temporal Classification (CTC) criterion for ASR, and multiple online processing policies for simultaneous ST.

Data Handling and Processing

fairseq S2T provides elaborate mechanisms for handling and preprocessing data. The toolkit extracts speech features compliant with Kaldi standards, employing tools like PyKaldi and torchaudio. It also supports advanced techniques such as SpecAugment and offers an interface for users to define custom data transformations. The inclusion of various tokenization strategies enhances its adaptability to different languages and dialects.

Evaluation and Visualization

The toolkit includes metrics like WER for ASR, and BLEU and chrF for ST and MT. Visualization features such as integration with Tensorboard and VizSeq allow for in-depth error analysis and performance monitoring, making the framework robust for research and development purposes.

Experimental Results

Experiments are conducted on prominent benchmarks including LibriSpeech, MuST-C, and CoVoST 2, illustrating the models' competitive performance. Notably, the Conformer-based wav2vec 2.0 model achieves state-of-the-art results on LibriSpeech. The research also examines multilingually trained models, which display enhanced performance over bilingual counterparts, demonstrating the potential of multilingual modeling in low-resource settings.

Implications and Future Work

The development of fairseq S2T implies significant advancements in the efficiency and effectiveness of S2T models. By facilitating multilingual capabilities and extensible architectures, it holds promise for applications in diverse linguistic environments. Potential future enhancements may include further exploration of self-supervised speech features and expanding support for a wider array of language tasks under diverse acoustic conditions.

Overall, fairseq S2T represents a substantive contribution to the field of speech-to-text processing, offering a comprehensive and versatile toolkit well-aligned with current trends in AI and machine learning. This positions it as a key resource for researchers aiming to innovate in speech recognition and translation tasks.

PDF Markdown