Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

Published 25 May 2018 in cs.CL | (1805.10387v2)

Abstract: We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training. Benchmarks on machine translation and speech recognition tasks show that models built using OpenSeq2Seq give state-of-the-art performance at 1.5-3x less training time. OpenSeq2Seq currently provides building blocks for models that solve a wide range of tasks including neural machine translation, automatic speech recognition, and speech synthesis.

Abstract PDF Upgrade to Chat

Citations (49)

View on Semantic Scholar

Summary

The paper introduces a versatile toolkit that uses mixed-precision and distributed processing to accelerate NLP and ASR training without compromising accuracy.
It demonstrates training time reductions of up to 3x–3.6x on tasks like neural machine translation and speech recognition while maintaining competitive performance metrics.
It employs a modular TensorFlow architecture with Horovod integration, enabling flexible model construction and efficient multi-GPU, multi-node distributed training.

Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

The paper "Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq" introduces OpenSeq2Seq, a toolkit based on TensorFlow designed for the efficient training of sequence-to-sequence models. The authors emphasize the toolkit's adaptability, offering distributed and mixed-precision training, aimed at enhancing computational efficiency on tasks such as neural machine translation (NMT) and automatic speech recognition (ASR).

Key Features of OpenSeq2Seq

OpenSeq2Seq is distinguished by several important features:

Modular Architecture: The toolkit provides a flexible framework where models can be easily constructed by combining various components. Core classes like DataLayer, Model, Encoder, Decoder, and Loss allow users to innovate by creating new models or customizing existing ones. This architecture caters to a range of tasks, including but not limited to image classification and sentiment analysis.
Mixed-Precision Training: Utilizing Tensor Cores on NVIDIA's Volta and Turing GPUs, the toolkit supports mixed-precision training. This approach performs computations in FP16 while accumulating results in FP32 to leverage reduced precision for higher computational throughput and reduced memory consumption. The use of mixed-precision results in training time reductions of up to 3x without compromising model accuracy.
Distributed Training with Horovod: OpenSeq2Seq integrates the Horovod library, enabling efficient distributed training across multiple GPUs and nodes. This is particularly beneficial for large-scale models, enhancing scalability and reducing training time compared to traditional parameter server-based methods.

Numerical Results and Performance

The research presents empirical results indicating significant performance improvements with mixed-precision training. For instance, in NMT tasks using the WMT 2016 dataset, models such as GNMT, ConvS2S, and Transformer demonstrate robust BLEU scores with noticeable reductions in training time, confirming the efficiency of mixed-precision.

In ASR, models like Deep Speech 2 and Wave2Letter+ trained on the LibriSpeech dataset exhibit substantial decreases in word error rates while accelerating training time by factors up to 3.6x. The toolkit's implementation ensures that mixed-precision training maintains convergence parity with traditional FP32 training, a critical aspect evidenced by comparable training loss curves.

Practical and Theoretical Implications

The practical implications of OpenSeq2Seq lie in its ability to alleviate computational resource constraints, allowing practitioners to train larger, more complex models faster and using less memory. By incorporating mixed-precision training, the toolkit enables broader accessibility to advanced model architectures, facilitating rapid experimentation and deployment.

Theoretically, this work contributes to the understanding of mixed-precision arithmetic's viability in maintaining model accuracy and convergence, opening avenues for further research in other machine learning domains and hardware optimizations.

Future Directions

The development of OpenSeq2Seq sets the stage for ongoing enhancements in its modular capabilities. Future expansions could incorporate additional encoders and decoders, extending its applicability to tasks like text classification and image-to-text conversions. As computational hardware evolves, further optimizations for emerging architectures are anticipated, potentially integrating advancements in GPU technologies and distributed computing frameworks.

Overall, OpenSeq2Seq exemplifies efficient model training through its strategic use of mixed-precision and distributed training methodologies, positing itself as a valuable tool for researchers and developers in the field of NLP and speech recognition.

Markdown