Magnitude Invariant Parametrizations Improve Hypernetwork Learning (2304.07645v2)
Abstract: Hypernetworks, neural networks that predict the parameters of another neural network, are powerful models that have been successfully used in diverse applications from image generation to multi-task learning. Unfortunately, existing hypernetworks are often challenging to train. Training typically converges far more slowly than for non-hypernetwork models, and the rate of convergence can be very sensitive to hyperparameter choices. In this work, we identify a fundamental and previously unidentified problem that contributes to the challenge of training hypernetworks: a magnitude proportionality between the inputs and outputs of the hypernetwork. We demonstrate both analytically and empirically that this can lead to unstable optimization, thereby slowing down convergence, and sometimes even preventing any learning. We present a simple solution to this problem using a revised hypernetwork formulation that we call Magnitude Invariant Parametrizations (MIP). We demonstrate the proposed solution on several hypernetwork tasks, where it consistently stabilizes training and achieves faster convergence. Furthermore, we perform a comprehensive ablation study including choices of activation function, normalization strategies, input dimensionality, and hypernetwork architecture; and find that MIP improves training in all scenarios. We provide easy-to-use code that can turn existing networks into MIP-based hypernetworks.
- Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition, pages 18511–18521, 2022.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Multi-rate vae: Train once, get the full rate-distortion curve. arXiv preprint arXiv:2212.03905, 2022.
- Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging, 38(8):1788–1800, 2019.
- Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, pages 404–413. PMLR, 2018.
- Meta internal learning. Advances in Neural Information Processing Systems, 34:20645–20656, 2021.
- Understanding batch normalization. Advances in neural information processing systems, 31, 2018.
- Principled weight initialization for hypernetworks. In International Conference on Learning Representations, 2019.
- Lee R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
- Hyperinverter: Improving stylegan inversion via hypernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11389–11398, 2022.
- You only train once: Loss-conditional training of deep networks. In International conference on learning representations, 2020.
- Continual learning in recurrent neural networks. In International Conference on Learning Representations, 2021.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Deep learning. MIT press, 2016.
- Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Posterior meta-replay for continual learning. In Conference on Neural Information Processing Systems, 2021.
- Learning the effect of registration hyperparameters with hypermorph. Machine Learning for Biomedical Imaging, 1:1–30, 2022.
- Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. arXiv preprint arXiv:1702.03275, 2017.
- Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26:315–323, 2013.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Self-normalizing neural networks. In Proceedings of the 31st international conference on neural information processing systems, pages 972–981, 2017.
- Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
- Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
- Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419, 2018.
- Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017.
- Beyond batchnorm: Towards a unified understanding of normalization in deep learning. Advances in Neural Information Processing Systems, 34:4778–4791, 2021.
- Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Citeseer, 2013.
- Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088, 2019.
- Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489, 2021.
- Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience, 19(9):1498–1507, 2007.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- A visual vocabulary for flower classification. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1447–1454. IEEE, 2006.
- Amortized learning of dynamic feature scaling for image segmentation. arXiv preprint arXiv:2304.05448, 2023.
- Implicit weight uncertainty in neural networks. arXiv preprint arXiv:1711.01297, 2017.
- Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520, 2019.
- Random features for large-scale kernel machines. In NIPS, volume 3, page 5. Citeseer, 2007.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Hypergan: A generative model for diverse, performant neural networks. In International Conference on Machine Learning, pages 5361–5369. PMLR, 2019.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- A stochastic gradient method with an exponential convergence rate for finite training sets. arXiv preprint arXiv:1202.6258, 2012.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29:901–909, 2016.
- How does batch normalization help optimization? In Proceedings of the 32nd international conference on neural information processing systems, pages 2488–2498, 2018.
- Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion. arXiv preprint arXiv:1906.00794, 2019.
- Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems, 33, 2020.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, 2020.
- Hypergrid transformers: Towards a single model for multiple tasks. 2021.
- Hypernetwork-based implicit posterior estimation and model averaging of cnn. In Asian Conference on Machine Learning, pages 176–191. PMLR, 2018.
- Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Continual learning with hypernetworks. In International Conference on Learning Representations, 2020.
- Regularization-agnostic compressed sensing mri reconstruction with hypernetworks. arXiv preprint arXiv:2101.02194, 2021.
- Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
- Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Adding conditional control to text-to-image diffusion models, 2023.
- Meta-learning via hypernetworks. 2020.
- Meta-learning symmetries by reparameterization. arXiv preprint arXiv:2007.02933, 2020.