Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition (2206.15408v1)

Published 30 Jun 2022 in eess.AS, cs.AI, and eess.SP

Abstract: We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators. Our method is inspired from Lloyd-Max compression theory with practical adaptations for a feasible computational overhead during training. With the quantization centroids derived from a 32-bit baseline, we augment training loss with a Multi-Regional Absolute Cosine (MRACos) regularizer that aggregates weights towards their nearest centroid, effectively acting as a pseudo compressor. Additionally, a periodically invoked hard compressor is introduced to improve the convergence rate by emulating runtime model weight quantization. We apply S8BQAT on speech recognition tasks using Recurrent Neural NetworkTransducer (RNN-T) architecture. With S8BQAT, we are able to increase the model parameter size to reduce the word error rate by 4-16% relatively, while still improving latency by 5%.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (7)

Kai Zhen (18 papers)
Hieu Duy Nguyen (11 papers)
Raviteja Chinta (2 papers)
Nathan Susanj (12 papers)
Athanasios Mouchtaris (31 papers)
Tariq Afzal (5 papers)
Ariya Rastrow (55 papers)

Citations (11)

View on Semantic Scholar

Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition (2206.15408v1)

Related Papers