Entropy-SGD: Biasing Gradient Descent Into Wide Valleys (1611.01838v5)

Published 6 Nov 2016 in cs.LG and stat.ML

Abstract: This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

Citations (739)

View on Semantic Scholar

Summary

The paper introduces a novel local-entropy objective that biases gradient descent towards broad, flat regions, enhancing generalization in deep networks.
The algorithm employs nested Langevin dynamics within SGD iterations to estimate the loss surface geometry and smooth the optimization landscape.
Experimental results on CNNs and RNNs demonstrate that Entropy-SGD achieves competitive performance with faster convergence and enhanced stability.

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

This paper introduces the Entropy-SGD optimization algorithm, specifically designed for the training of deep neural networks, leveraging the local geometry of the energy landscape. The key insight behind Entropy-SGD is the observation that local extrema with low generalization error feature a significant proportion of nearly-zero eigenvalues in the Hessian matrix. Such extrema are found in wide, flat valleys of the energy landscape, as opposed to narrow, sharp valleys that correspond to poorly generalizable solutions.

Key Contributions

Local-Entropy-Based Objective Function: The paper proposes a new objective function biased towards these well-generalizable, wide valleys. This function integrates a local entropy term that biases the search process towards flat regions in the energy landscape. Specifically, the authors define a local entropy measure that log-transforms the partition function over a local region of the weight space.
Algorithm Description: Conceptually, Entropy-SGD is composed of two nested loops of SGD. The inner loop performs iterations based on Langevin dynamics to compute the gradient of the local entropy. This allows the algorithm to estimate the local geometry, essentially smoothing the energy landscape and enhancing generalization capabilities. The outer loop then updates the network weights according to this entropy-biased gradient.
Theoretical Analysis: The paper analytically demonstrates that the new objective results in a smoother energy landscape and shows improved generalization error using uniform stability theory. The results depend on the smoothness of the modified objective and the stability bounds provided by Hardt et al. (2015).
Experimental Validation: The experimental results on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) demonstrate that Entropy-SGD achieves comparable or better generalization error compared to state-of-the-art optimization techniques. For instance, on the CIFAR-10 dataset, the All-CNN-BN network trained with Entropy-SGD achieves an error of 7.81%, comparable to 7.71% obtained with SGD in fewer effective epochs. Similarly, on the PTB dataset using the PTB-LSTM network, Entropy-SGD achieves a test perplexity of 77.656 compared to 78.6 with standard SGD.

Implications

The practical implications of using Entropy-SGD are significant. The algorithm consistently finds solutions in flat regions of the loss surface, which not only helps in obtaining well-generalizable models but also accelerates convergence, particularly in the context of RNNs. This improvement can be attributed to the smoother modified loss function, which mitigates sharp minima and saddle points that typically slow down or trap traditional gradient descent methods.

Theoretical Implications and Future Directions

The theoretical findings imply that ensuring optimization algorithms seek wide valleys can lead to configurations with low generalization error. This reinforces the importance of the local geometric properties of the optimization landscape. Future research might focus on further improving the efficiency of estimating these entropy-based gradients and adapting similar strategies to other optimization paradigms. Moreover, exploring more sophisticated MCMC algorithms to approximate local entropy could potentially yield even faster convergence and better generalization properties.

Conclusion

Entropy-SGD represents a thoughtful integration of statistical physics principles into deep learning optimization algorithms. By biasing the search towards wide valleys in the energy landscape, it ensures that deep networks not only converge efficiently but also generalize effectively. The methodological advancements and experimental validations presented in this paper bolster the understanding of the intrinsic properties of deep learning models, paving the way for more robust and efficient training protocols.