meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (1706.06197v5)

Published 19 Jun 2017 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-$k$ elements (in terms of magnitude) are kept. As a result, only $k$ rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction ($k$ divided by the vector dimension) in the computational cost. Surprisingly, experimental results demonstrate that we can update only 1-4% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The code is available at https://github.com/lancopku/meProp

Citations (149)

View on Semantic Scholar

Summary

The paper introduces meProp, a selective gradient update method that accelerates backpropagation and improves generalization.
It employs a top-k strategy to update only the most significant gradients, reducing unnecessary weight adjustments.
Experiments on LSTM, MLP, and various tasks demonstrate up to 69.2x speed improvements while maintaining or boosting accuracy.

Analyzing "meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting"

The paper "meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting" introduces a novel method for optimizing neural network learning by sparsifying the backpropagation process, which is a central yet computationally intensive component of training neural networks. Termed as minimal effort backpropagation (meProp), the method sparsifies the gradient descent process during network training, ostensibly to both speed up computation and potentially reduce model overfitting.

Research Motivation and Proposed Approach

Traditional backpropagation involves computing full gradients across potentially millions of parameters, incurring substantial computational expenses and possibly exacerbating overfitting by over-updating insignificant weights, akin to adjusting noise. meProp addresses these inefficiencies by operating on a reduced gradient calculated from only the top- $k$ elements, in terms of magnitude, during the backward pass. This selective update approach is proposed to minimize computational costs and enhance generalization by focusing only on the most significant weight changes.

The main questions addressed in the research include:

Selection of Relevant Parameters: meProp employs a top- $k$ search strategy to identify the most critical parameters for each stochastic sample.
Impact on Model Accuracy: Experimental evidence demonstrates that minimizing parameter updates (to just 1-4% of weights) does not deteriorate, but rather often improves model accuracy, likely due to reduced overfitting akin to what's achieved with techniques like dropout.

Experimental Validation

The paper conducts a comprehensive set of experiments across different models, optimizers, and tasks to validate the efficacy of meProp, including:

Models: Long Short-Term Memory (LSTM) networks and Multi-Layer Perceptrons (MLPs).
Optimizers: Adam and AdaGrad.
Tasks: POS-tagging, dependency parsing, and MNIST image recognition.

Key results highlight that meProp accelerates backpropagation by substantial factors (up to 69.2x), with accuracy improvements reported across various tasks. The results reinforce the utility of meProp in maintaining, if not increasing, the efficiency and effectiveness of neural models. Furthermore, when combined with dropout, meProp consistently enhanced performance, suggesting that meProp tackles a distinct dimension of overfitting.

Implications and Future Directions

The proposed approach has crucial theoretical and practical implications:

Theoretically, meProp suggests a revised understanding and handling of gradient information during neural network training, emphasizing the relevance of focusing updates on truly impactful parameters.
Practically, it offers an efficient alternative for addressing computational barriers associated with backpropagation, fostering enhanced deployment of deep learning models in resource-constrained environments.

Future work could explore more sophisticated techniques for identifying significant parameters beyond the top- $k$ strategy or investigate integration with other model regularization strategies. Additionally, extending this sparse gradient selection principle to other domains in machine learning could be a productive area of research.

The paper’s findings importantly extend the toolkit available for developing more computationally feasible and potentially robust neural network models, fostering sustainable advancements in machine learning practices. The availability of implementation code provides a practical resource for further experimentation and adoption within the research community.

PDF Markdown

Related Papers

GitHub

GitHub - lancopku/meProp: meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017) (110 stars)