Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Principled Weight Initialization for Hypernetworks (2312.08399v1)

Published 13 Dec 2023 in cs.LG

Abstract: Hypernetworks are meta neural networks that generate weights for a main neural network in an end-to-end differentiable manner. Despite extensive applications ranging from multi-task learning to Bayesian deep learning, the problem of optimizing hypernetworks has not been studied to date. We observe that classical weight initialization methods like Glorot & Bengio (2010) and He et al. (2015), when applied directly on a hypernet, fail to produce weights for the mainnet in the correct scale. We develop principled techniques for weight initialization in hypernets, and show that they lead to more stable mainnet weights, lower training loss, and faster convergence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Hypernetwork knowledge graph embeddings. arXiv preprint arXiv:1808.07018, 2018.
  2. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153–160, 2007.
  3. Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
  4. A generative model for sampling high-performance and diverse weights for neural networks. arXiv preprint arXiv:1905.02898, 2019.
  5. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  249–256, 2010.
  6. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
  7. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  8. Approximating the predictive distribution via adversarially-trained hypernetworks. In Bayesian Deep Learning Workshop, NeurIPS (Spotlight), volume 2018, 2018.
  9. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  10. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  11. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  12. Hypernetwork functional image representation. arXiv preprint arXiv:1902.10404, 2019.
  13. Evolving neural networks in compressed weight space. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp.  619–626. ACM, 2010.
  14. Predictive uncertainty quantification with compound density networks. arXiv preprint arXiv:1902.01080, 2019.
  15. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
  16. Computing higher order derivatives of matrix and tensor expressions. In Advances in Neural Information Processing Systems, pp. 2755–2764, 2018.
  17. Metapruning: Meta learning for automatic neural network channel pruning. arXiv preprint arXiv:1903.10258, 2019.
  18. Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419, 2018.
  19. Modular universal reparameterization: Deep multi-task learning across diverse domains. arXiv preprint arXiv:1906.00097, 2019.
  20. Hyperst-net: Hypernetworks for spatio-temporal forecasting. arXiv preprint arXiv:1809.10889, 2018.
  21. Implicit weight uncertainty in neural networks. arXiv preprint arXiv:1711.01297, 2017.
  22. Neale Ratzlaff and Li Fuxin. Hypergan: A generative model for diverse, performant neural networks. arXiv preprint arXiv:1901.11058, 2019.
  23. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-RZ.
  24. Learning representations by back-propagating errors. NATURE, 323:9, 1986.
  25. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  26. Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion. arXiv preprint arXiv:1906.00794, 2019.
  27. Meta networks for neural style transfer. arXiv preprint arXiv:1709.04111, 2017.
  28. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  29. A hypercube-based encoding for evolving large-scale neural networks. Artificial life, 15(2):185–212, 2009.
  30. Joseph Suarez. Language modeling with recurrent highway hypernetworks. In Advances in neural information processing systems, pp. 3267–3276, 2017.
  31. Hypernetworks with statistical filtering for defending adversarial examples. arXiv preprint arXiv:1711.01791, 2017.
  32. Hypernetwork-based implicit posterior estimation and model averaging of cnn. In Asian Conference on Machine Learning, pp.  176–191, 2018.
  33. Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695, 2019.
  34. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158, 2017.
  35. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  3–19, 2018.
  36. Graph hypernetworks for neural architecture search. arXiv preprint arXiv:1810.05749, 2018.
  37. Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.
Citations (70)

Summary

We haven't generated a summary for this paper yet.