GradMax: Growing Neural Networks using Gradient Information (2201.05125v3)

Published 13 Jan 2022 in cs.LG and cs.CV

Abstract: The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.

Citations (44)

View on Semantic Scholar

Summary

The paper introduces Gradient Maximizing Growth (GradMax) to optimize network expansion by maximizing the new weights' gradient norm while preserving the network's output.
It employs singular value decomposition (SVD) for optimal initialization of outgoing weights, ensuring rapid learning and stable performance across architectures.
Empirical results show that GradMax accelerates training in both fully connected and convolutional networks, with further applications in inserting new layers and adapting to transformer models.

GradMax: Efficient Neural Network Growing Using Singular Value Decomposition

Introduction to Neural Network Growing Techniques

Growing neural networks dynamically during training presents an advantageous approach to balancing model complexity and computational efficiency. Traditional methods either expand the architecture by splitting existing neurons or adjoin random new neurons, striving to maintain the network's prior knowledge. The principal challenge lies in choosing an initialization method for newly added capacities without compromising the previous learning phase.

GradMax: A Novel Approach

"GradMax: Growing Neural Networks using Gradient Information" introduces a strategic method, termed Gradient Maximizing Growth (GradMax), focusing on enhancing future training dynamics rather than minimizing immediate training loss. The core strategy of GradMax is to add new neurons in a way that maximizes the norm of the gradient of the new weights, under the condition that the network's output remains unchanged by the growth process. This is achieved by setting the incoming weights to zero and using singular value decomposition (SVD) to find the optimal initialization for the outgoing weights.

Mathematical Foundation

The approach is formalized as an optimization problem aimed at maximizing the sum of squares of the Frobenius norm of the gradients related to the newly added weights, subject to constraints on weight norms and output preservation.

GradMax addresses the optimization by making simplifications, such as assuming the existence of an activation function that maps zero to zero with a derivative of one at the origin. Under these stipulations, the maximization problem is unraveled in a closed form using SVD, which enables finding the maximum gradient norm for the new neurons' outgoing weights.

Practical Implications and Theoretical Justifications

The essence of GradMax is rooted in the principle that larger gradients catalyze more rapid learning. The method has been tested across various standard and convolutional neural network architectures, demonstrating accelerated training and improved model performance. The approach is notably effective when integrated early in the training process, enhancing the model's capacity to learn swiftly and effectively without necessitating extensive architectural redesigns.

Extensions and Future Directions

GradMax's application is not restricted to adding new neurons to existing layers; it can also guide the insertion of fully new layers between existing ones, offering a versatile tool for network expansion. Although primarily discussed in the context of fully connected and convolutional layers, the method's principles hold potential for adaptation to other architectures, including transformers.

The introduction of GradMax opens avenues for research in dynamically adjusting network architectures during learning processes. It poses questions about the optimal timing and locations for growth, along with exploring the method's compatibility and effectiveness within broader neural architecture search (NAS) methodologies and in combination with various activation functions and regularizations.

Conclusion

GradMax presents a compelling argument for the growth of neural networks focused on optimizing future training dynamics. By leveraging gradient maximization, facilitated by singular value decomposition, GradMax provides a principled and effective way to expand neural network architectures dynamically. This research contributes a strategic tool to the machine learning arsenal, potentially reducing the computational costs associated with large-scale architectures while retaining, if not enhancing, model performance.

Related Papers

YouTube

Show All Videos