Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 69 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training (2312.00359v1)

Published 1 Dec 2023 in cs.LG and stat.ML

Abstract: Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (99)
  1. $\alpha$-req : Assessing representation quality in self-supervised learning by measuring eigenspectrum decay. In Advances in Neural Information Processing Systems, 2022.
  2. Powerlaw: a python package for analysis of heavy-tailed distributions. PloS one, 9(1):e85777, 2014.
  3. Comparing dynamics: Deep neural networks versus glassy systems. In International Conference on Machine Learning, pages 314–323, 2018.
  4. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
  5. About the ergodic regime in the analogical hopfield neural networks: moments of the partition function. Journal of Mathematical Physics, 49(12):125217, 2008.
  6. On the equivalence of hopfield networks and boltzmann machines. Neural Networks, 34:1–9, 2012.
  7. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017.
  8. Benign overfitting in linear regression. In Proceedings of the National Academy of Sciences, 2020.
  9. Stephen G. Brush. History of the lenz-ising model. Reviews of Modern Physics, 39:883–893, 1967.
  10. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. In International Conference on Learning Representations, 2021.
  11. Metalr: Layer-wise learning rate based on meta-learning for adaptively fine-tuning medical pre-trained models. In Medical Image Computing and Computer Assisted Intervention, 2023.
  12. Power-law distributions in empirical data. SIAM review, 51(4):661–703, 2009.
  13. Random Matrix Methods for Machine Learning. Cambridge University Press, 2022.
  14. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  15. Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  16. HAWQ: Hessian aware quantization of neural networks with mixed-precision. In IEEE/CVF International Conference on Computer Vision, 2019.
  17. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
  18. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020.
  19. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Annual Conference on Uncertainty in Artificial Intelligence, 2017.
  20. In search of robust measures of generalization. In Advances in Neural Information Processing Systems, 2020.
  21. Andreas Engel and Christian Van den Broeck. Statistical mechanics of learning. Cambridge University Press, 2001.
  22. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–338, 2010.
  23. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, 2018.
  24. ”lossless” compression of deep neural networks: A high-dimensional neural tangent kernel approach. In Advances in Neural Information Processing Systems, 2022.
  25. The heavy-tail phenomenon in sgd. In International Conference on Machine Learning, 2021.
  26. Rigorous learning curve bounds from statistical mechanics. In Proceedings of the Seventh Annual Conference on Computational Learning Theory, page 76–87, 1994.
  27. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  28. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In International Conference on Learning Representations, 2021.
  29. Bruce M Hill. A simple general approach to inference about the tail of a distribution. The Annals of Statistics, pages 1163–1174, 1975.
  30. Multiplicative noise and heavy tails in stochastic optimization. In International Conference on Machine Learning, pages 4262–4274, 2021.
  31. Generalization bounds using lower tail exponents in stochastic optimizers. In International Conference on Machine Learning, pages 8774–8795, 2022.
  32. John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 1982.
  33. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 328–339, 2018.
  34. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018.
  35. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2019.
  36. Universal statistics of fisher information in deep neural networks: Mean field approach. In the 22nd International Conference on Artificial Intelligence and Statistics, 2019a.
  37. Pathological spectra of the fisher information metric and its variants in deep neural networks. Neural Computation, 33:2274–2307, 2019b.
  38. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
  39. Cifar-10 and cifar-100 datasets. 2009.
  40. Pac-bayes & margins. In Advances in Neural Information Processing Systems, 2002.
  41. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  42. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
  43. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755, 2014.
  44. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2020.
  45. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, 2015.
  46. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  47. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  48. A tensorized transformer for language modeling. Advances in neural information processing systems, 32, 2019.
  49. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
  50. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. Technical Report Preprint: arXiv:1710.09553, 2017.
  51. Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, 2019.
  52. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In SIAM International Conference on Data Mining, 2020.
  53. Post-mortem on a deep learning contest: a Simpson’s paradox and the complementary roles of scale metrics versus shape metrics. Technical Report Preprint: arXiv:2106.00734, 2021a.
  54. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 22(165):1–73, 2021b.
  55. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nature Communications, 12(1):1–13, 2021.
  56. David A McAllester. Some pac-bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998.
  57. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  58. On 1/n neural representation and robustness. In Advances in Neural Information Processing Systems, 2020.
  59. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
  60. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
  61. Yudi Pawitan. In all likelihood: statistical modelling and inference using likelihood. Oxford University Press, 2001.
  62. Algorithmic stability of heavy-tailed sgd with general loss functions. In International Conference on Machine Learning, pages 28578–28597, 2023.
  63. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
  64. Simulated annealing algorithm for deep learning. Procedia Computer Science, 72:137–144, 2015.
  65. Autolr: Layer-wise pruning and auto-tuning of learning rates in fine-tuning of deep networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  66. Stable rank normalization for improved generalization in neural networks and gans. In International Conference on Learning Representations, 2020.
  67. Traffic signs and pedestrians vision with multi-scale convolutional networks. In Snowbird Machine Learning Workshop, 2011.
  68. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056–6091, 1992.
  69. Q-BERT: Hessian based ultra low precision quantization of bert. In AAAI Conference on Artificial Intelligence, 2020.
  70. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  71. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837, 2019.
  72. Hausdorff dimension, heavy tails, and generalization in neural networks. Advances in Neural Information Processing Systems, 33:5138–5151, 2020.
  73. Layer-specific adaptive learning rates for deep networks. In IEEE 14th International Conference on Machine Learning and Applications, 2015.
  74. Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision, pages 464–472, 2017.
  75. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018.
  76. Don’t decay the learning rate, increase the batch size. In International Conference on Learning Representations, 2018.
  77. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  78. Haim Sompolinsky. Statistical mechanics of neural networks. Physics Today, 41(21):70–80, 1988.
  79. Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems, 2022.
  80. The statistical mechanics of learning a rule. Rev. Mod. Phys., 65:499–556, Apr 1993.
  81. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.
  82. Heavy-tailed regularization of weight matrices in deep neural networks. arXiv preprint arXiv:2304.02911, 2023.
  83. On the power-law hessian spectrums in deep learning. arXiv preprint arXiv:2201.13011, 2022a.
  84. Rethinking the structure of stochastic gradients: Empirical and statistical evidence. arXiv preprint arXiv:2212.02083, 2022b.
  85. Hero: Hessian-enhanced robust optimization for unifying and improving generalization and quantization performance. In Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022.
  86. Taxonomizing local versus global structure in neural network loss landscapes. In Advances in Neural Information Processing Systems, 2021.
  87. Test accuracy vs. generalization gap: Model selection in nlp without accessing training or testing data. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3011–3021, 2023.
  88. PyHessian: Neural networks through the lens of the hessian. In IEEE International Conference on Big Data, pages 581–590, 2020.
  89. Adahessian: An adaptive second order optimizer for machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  90. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
  91. Scaling SGD batch size to 32k for ImageNet training. arXiv preprint arXiv:1708.03888, 6(12):6, 2017.
  92. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing, 2018.
  93. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020.
  94. Ka-Veng Yuen. Bayesian methods for structural dynamics and civil engineering. John Wiley & Sons, 2010.
  95. Wide residual networks. In Proceedings of the British Machine Vision Conference, 2016.
  96. Rethinking lipschitz neural networks and certified robustness: A boolean function perspective. In Advances in Neural Information Processing Systems, 2022.
  97. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, 2019.
  98. A three-regime model of network pruning. In International Conference on Machine Learning, 2023.
  99. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems, 2020.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 posts and received 5 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube