QuartzNet: Efficient ASR with 1D Convolutions

Updated 28 February 2026

QuartzNet is a family of end-to-end convolutional neural networks that use 1D time–channel separable convolutions to achieve efficient ASR with significantly fewer parameters.
The architecture employs residual blocks with depthwise and pointwise convolutions alongside robust quantization techniques for fast, memory-efficient inference on edge devices.
QuartzNet supports transfer learning for adapting to diverse speech tasks, including paralinguistic attribute inference like age and gender estimation, while delivering competitive performance benchmarks.

QuartzNet is a family of end-to-end convolutional neural network models for automatic speech recognition (ASR) characterized by the use of deep stacks of 1D time–channel separable convolutions and efficient parameterization, enabling state-of-the-art ASR with an order of magnitude fewer parameters than traditional models. Originally introduced for ASR tasks, it has also been adapted as a generic convolutional feature extractor for paralinguistic speech attribute inference, such as age and gender estimation. The architecture is amenable to efficient quantization techniques, delivering fast and memory-efficient inference suitable for deployment on edge devices while maintaining strong recognition accuracy (Kriman et al., 2019, Kwasny et al., 2020, Kim et al., 2021).

1. Architectural Design and Core Components

QuartzNet is defined by the exclusive use of 1D time–channel separable convolutions, a variant of depthwise–separable convolution, within residual block structures. The canonical configurations are denoted as “QuartzNet‑ $N\times R$ ”, where $N$ is the number of residual blocks, each composed of $R$ repeated separable convolutional modules.

A QuartzNet block consists of:

Depthwise 1D convolution: Operates on each channel separately with kernel length $K$ , parameterized by $K\cdot C_{in}$ .
Pointwise (1x1) convolution: Fuses information across channels, parameterized by $C_{in}\cdot C_{out}$ .
Batch normalization (channel-wise): Stabilizes training and supports robust quantization.
ReLU activation: Provides non-linearity and, in quantized deployments, serves as the sole non-linear operation.

Each block is wrapped with an identity skip connection. When a block is repeated $S$ times, each repetition maintains its own residual link. The input feature extractor is typically a mel-spectrogram frontend, either fixed or trainable. The output is mapped to character probabilities using a final 1×1 convolution and trained with Connectionist Temporal Classification (CTC) loss (Kriman et al., 2019, Kim et al., 2021).

Representative Configurations

Model	Blocks ( $N$ )	Repeats per Block ( $R$ )	Parameters (M)
QuartzNet-5×5	5	5	6.7
QuartzNet-10×5	10	5	12.8
QuartzNet-15×5	15	5	18.9

The factorized convolutional structure achieves parameter reduction from $K\cdot C_{in}\cdot C_{out}$ (standard 1D convolution) to $K\cdot C_{in} + C_{in}\cdot C_{out}$ per module, supporting the use of long temporal kernels (up to 75 frames) without incurring excessive computational overhead (Kriman et al., 2019).

2. Training Paradigms and Optimization

QuartzNet models are trained end-to-end with CTC loss, which maximizes the probability of all valid alignments between the input feature sequence and the target symbol sequence:

$P(y|X) = \sum_{\pi\in B^{-1}(y)} \prod_{t=1}^T P(\pi_t|X),$

where $B(\cdot)$ is the mapping that collapses repeated symbols and removes blanks (Kriman et al., 2019).

Optimization utilizes the NovoGrad optimizer with weight decay. The learning rate follows a cosine-annealing schedule, preceded by a linear warm-up phase of 1,000–10,000 steps. Mixed-precision training (FP16/FP32) accelerates convergence. Input-level data augmentation includes 10% speed perturbation and SpecCutout or related spectrogram masking (Kriman et al., 2019, Kwasny et al., 2020).

For paralinguistic transfer tasks (e.g., age/gender estimation), a two-stage transfer learning scheme is employed: pretraining the QuartzNet embedder on large-scale speaker-ID tasks (VoxCeleb1), then further pretraining or fine-tuning on target attributes with multitask objectives (Kwasny et al., 2020).

3. Quantization and Deployment Efficiency

QuartzNet can be quantized to enable integer-only inference, critical for deployment on edge accelerators. A uniform, symmetric quantization scheme is applied to both weights and activations using per-tensor scaling (and optionally zero-point 0 for symmetry):

Weight quantization: $w_q = \mathrm{clamp}\bigl(\mathrm{round}(W_f/s_w), q_{\min}, q_{\max}\bigr)$ , $s_w = \alpha_w / (2^{b-1}-1)$ with $\alpha_w = \max|W_f|$ .
Activation quantization: $a_q = \mathrm{clamp}\bigl(\mathrm{round}(A_f/s_a), q_{\min}, q_{\max}\bigr)$ , $s_a = \alpha_a / (2^{b-1}-1)$ with $\alpha_a$ via calibration.

A zero-shot calibration procedure is utilized to determine activation clipping ranges without requiring real speech data. Synthetic mel-spectrogram “calibration” inputs are generated by minimizing the KL divergence between synthetic and observed batch-norm statistics. BatchNorm folding is performed before quantization, yielding an inference pipeline that consists only of integer Conv1D, bias addition, ReLU, and clamping (Kim et al., 2021).

Edge deployment characteristics:

INT8 QuartzNet-15×5 achieves 4× model size reduction (73.8 MB to 18.5 MB) and 2.35× end-to-end inference speedup on NVIDIA T4 GPUs, with <0.3% absolute WER degradation on LibriSpeech test-other.
All compute, including non-linearities and sequence decoding, are realized with integer arithmetic and low-order polynomial approximations.
The quantized model is compatible with common inference frameworks such as TensorRT, TVM, and ONNX Runtime (Kim et al., 2021).

4. Performance Benchmarks and Transferability

On standard ASR corpora, QuartzNet demonstrates competitive or superior performance relative to much larger models. For instance, QuartzNet-15×5 (18.9 M parameters) achieves 2.96%/8.07% WER on LibriSpeech test-clean/test-other (with 6-gram LM), outperforming or matching models such as wav2letter++ (208 M parameters, 3.26%/10.47% WER) and Jasper-DR-10×5 (333 M parameters, 2.84%/7.84% WER) (Kriman et al., 2019).

On Wall Street Journal (WSJ), QuartzNet-5×3 (6.4 M parameters) delivers 5.8%/4.5% WER (eval92) with a T-XL LLM, compared to wav2letter++ (4.1% WER, 17 M parameters). Smaller QuartzNet variants substantially outperform larger scratch-trained baselines after moderate fine-tuning (e.g., 80 hours on WSJ) (Kriman et al., 2019).

For speaker attribute estimation, replacing multilayer-TDNN with a QuartzNet embedder and adopting staged pretraining yields state-of-the-art mean absolute error (MAE) and root mean squared error (RMSE) in age estimation, with overall gender classification accuracy of 99.6% on TIMIT TEST data (Kwasny et al., 2020).

5. Applications and Use Cases

QuartzNet’s primary application is acoustic modeling for large-vocabulary continuous speech recognition, where its parameter efficiency and accuracy make it suitable for server-scale as well as edge deployment. Its convolutional embedding layers have also been used in x-vector frameworks for non-ASR speech tasks, such as age and gender classification (Kwasny et al., 2020).

The quantized variant is optimized for real-time, low-power, and privacy-sensitive inference on edge devices and accelerators (e.g., Arm Cortex-M, Edge-TPU, or integer Tensor Cores), enabling offline speech recognition and paralinguistic analysis without accessing raw audio data at quantization time (Kim et al., 2021).

6. Model Extensibility and Transfer Learning

QuartzNet’s design supports straightforward transfer learning via fine-tuning on new domains or adapting to novel tasks. Pretraining on large, diverse corpora (LibriSpeech, Mozilla Common Voice, VoxCeleb) enables efficient adaptation with limited target-domain data. The encoder can be reused for tasks beyond ASR by integrating additional output heads for multitask learning—e.g., supplementary dense layers for regression or multi-class classification (Kwasny et al., 2020, Kriman et al., 2019).

A plausible implication is that depthwise-separable convolutional stacks serve as generic, lightweight feature extractors for spoken language and paralinguistic characteristics, provided appropriate transfer learning strategies are employed.

7. Significance and Impact

QuartzNet advances the efficiency frontier of convolutional sequence models for ASR by employing time–channel separable convolutions and residual architecture, enabling deep networks with practical memory and compute costs (Kriman et al., 2019). Its applicability to both ASR and related speech attribute tasks, together with strong quantization support allowing for pure-integer inference without real data, positions QuartzNet as a prominent backbone for research and commercial speech applications requiring deployment efficiency (Kim et al., 2021, Kwasny et al., 2020).

Markdown Report Issue Upgrade to Chat

References (3)

QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions (2019)

Joint gender and age estimation based on speech signals using x-vectors and transfer learning (2020)

Integer-only Zero-shot Quantization for Efficient Speech Recognition (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QuartzNet.