Exploring Sparsity in Recurrent Neural Networks (1704.05119v2)

Published 17 Apr 2017 in cs.LG and cs.CL

Abstract: Recurrent Neural Networks (RNN) are widely used to solve a variety of problems and as the quantity of data and the amount of available compute have increased, so have model sizes. The number of parameters in recent state-of-the-art networks makes them hard to deploy, especially on mobile phones and embedded devices. The challenge is due to both the size of the model and the time it takes to evaluate it. In order to deploy these RNNs efficiently, we propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network. At the end of training, the parameters of the network are sparse while accuracy is still close to the original dense neural network. The network size is reduced by 8x and the time required to train the model remains constant. Additionally, we can prune a larger dense network to achieve better than baseline performance while still reducing the total number of parameters significantly. Pruning RNNs reduces the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply. Benchmarks show that using our technique model size can be reduced by 90% and speed-up is around 2x to 7x.

PDF Abstract

Exploring Sparsity in Recurrent Neural Networks

The research paper "Exploring Sparsity in Recurrent Neural Networks" investigates an efficiency-enhancing technique for Recurrent Neural Networks (RNNs) by employing weight pruning during initial training phases. As data volume and computational framework capabilities have expanded, there is a trend toward increasing model sizes, making their deployment challenging, especially in memory-constrained environments like mobile devices and embedded systems. This paper presents a pruning technique that aims to maintain model accuracy while significantly reducing the number of parameters.

The authors focus on alleviating two primary constraints in deploying large RNN models: memory and computational demands. By setting a progressive threshold during training, they achieve sparsity of around 90%, thereby reducing model size by nearly 8x and maintaining similar training times to fully dense models. The technique demonstrates typical magnitude reduction with loss in accuracy remaining minimal compared to the dense baseline. Moreover, the approach is computationally promising, offering speed-ups of about 2x to 7x during inference due to the decreased need for dense matrix operations.

Significantly, the pruning technique does not rely on previous approaches such as approximating Hessians or require retraining phases, which has often hindered scalable deployment due to increased computational costs. Instead, it employs a simple heuristic that iteratively sets weight parameters based on an increasing threshold during standard training epochs. Such pruning is particularly relevant because it aligns seamlessly with current training frameworks and optimizers, facilitating ease of adaptation in existing systems.

Experimental analysis was conducted on 2100 hours of English speech data, using models like Deep Speech 2 and GRU architectures. Results indicate pruning enhances performance, with sparse networks showing ~20% worse CER in some configurations, but notably offering relative performance equal to or better than dense baselines when starting from larger, then pruned, model structures. For instance, a configuration using a sparse approach with 3072 hidden units delivered a notable 3.95% improvement over a dense baseline with 1760 units, demonstrating both the method’s efficacy and its potential for achieving state-of-the-art performance on constrained hardware.

The implementation relies on setting masks and thresholds controlled by a set of hyperparameters and executing regular pruning updates without gradient modification. This ability to quickly exclude insignificant weights without hindering computational efficiency is central to its effectiveness, offering substantial reductions in both network size and operational complexity.

Beyond theoretical insights, the paper discusses the practicalities of deploying such pruned models, exemplifying potential compression benefits on devices limited in storage. Notably, Deep Speech 2 could be compressed from 268 MB down to 32 MB for specific sparse configurations. The model showed a compression gain and enhanced inference performance, crucial for real-time applications in server and mobile settings.

Looking forward, this pruning mechanism poses several avenues for exploration. Potential developments could include applying this technique to other network types or layer types, such as LLMs or embeddings, expanding its utility across diverse neural architectures. Additionally, comparing with L1 regularization and developing improvements in sparse matrix operations can further boost the approach's adaptability and performance.

In summary, this paper contributes valuable insights to neural network deployment on resource-constrained environments, highlighting the importance of model sparsity to balance accuracy and efficiency. This emphasizes the growing need for smartly optimized AI models that cater to a range of processing capabilities, paving the way for more ubiquitous deep learning deployments.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Sharan Narang (31 papers)
Erich Elsen (28 papers)
Gregory Diamos (11 papers)
Shubho Sengupta (15 papers)

Citations (297)

View on Semantic Scholar

Exploring Sparsity in Recurrent Neural Networks (1704.05119v2)

Exploring Sparsity in Recurrent Neural Networks

Related Papers