QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions (1910.10261v1)

Published 22 Oct 2019 in eess.AS

Abstract: We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets.

Citations (280)

View on Semantic Scholar

Summary

The paper introduces QuartzNet, a novel ASR model that uses 1D time-channel separable convolutions to reduce parameters to under 20 million while maintaining competitive performance.
The QuartzNet-15x5 configuration achieves 3.90% WER on LibriSpeech test-clean and 11.28% on test-other with a 6-gram language model.
The reduced computational demands of QuartzNet enable efficient ASR deployment on resource-constrained devices without sacrificing accuracy.

An Examination of QUARTZNET: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

In the domain of automatic speech recognition (ASR), the drive toward models that balance high accuracy with computational efficiency is of prime interest. The paper "QUARTZNET: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions" presents a significant contribution in this respect by proposing an innovative neural network architecture geared towards achieving state-of-the-art performance with reduced parameter count and computational demand.

Core Contributions and Architecture

The central contribution of this research lies in the introduction of QuartzNet, a deep neural network model employing 1D time-channel separable convolutions. This model architecture draws influence from the Jasper architecture, but innovates by incorporating depthwise separable convolutions, thereby substantially reducing the number of parameters. Each basic block in QuartzNet consists of separable convolutional layers, batch normalization, and ReLU layers connected through residual pathways, culminating in a model trained with Connectionist Temporal Classification (CTC) loss.

QuartzNet markedly reduces model complexity while maintaining competitive performance. Notably, the model achieves comparable results to exisiting ASR models on LibriSpeech and Wall Street Journal datasets with fewer than 20 million parameters. By applying 1D time-channel separable convolutions, QuartzNet's architecture differentiates itself by allowing for the decoupling of convolution operations across time and channel dimensions, which results in improved efficiency.

Numerical Results and Comparisons

QuartzNet demonstrates strong performance on key ASR benchmarks with reduced model size. The QuartzNet-15x5 configuration, for instance, showcases word error rates (WER) that rival larger state-of-the-art models while consisting of only 18.9 million parameters. Through experimentation on LibriSpeech, the model achieves a WER of 3.90% and 11.28% on test-clean and test-other subsets, respectively, when using 6-gram LLMs. Furthermore, leveraging advanced LLMs like Transformer-XL (T-XL) pushes WERs further down to 2.69% and 7.25%.

Practical Implications and Theoretical Insights

The implications of QuartzNet's approach extend to both deployment and theoretical aspects of model design. Practically, the reduced parameter footprint furthers the deployment feasibility of ASR systems in resource-constrained environments, such as mobile and embedded devices. Theoretically, QuartzNet reinforces the trend toward architecturally efficient models capable of achieving high accuracy despite smaller sizes, a poignant reminder of the opportunities depthwise separable convolutions present.

Future Directions and Speculations

The paper hints at intriguing future directions, including the potential integration of QuartzNet's encoder with attention-based decoders. Such combinations could leverage the encoder's efficiency while capturing more complex hierarchical dependencies in speech data through sophisticated decoding mechanisms. Additionally, continued explorations into grouped convolutions and optimization techniques may provide avenues for further compression and performance gains.

Overall, the QuartzNet model marks a significant stride in ASR by demonstrating that high accuracy and low computational demands need not be mutually exclusive, paving the way for more efficient acoustic model designs in the field.

PDF Markdown