Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor (2403.04545v1)

Published 7 Mar 2024 in cs.LG, math.ST, and stat.TH

Abstract: Deep Residual Neural Networks (ResNets) have demonstrated remarkable success across a wide range of real-world applications. In this paper, we identify a suitable scaling factor (denoted by $\alpha$) on the residual branch of deep wide ResNets to achieve good generalization ability. We show that if $\alpha$ is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity. We also highlight a surprising phenomenon: even if we allow $\alpha$ to decrease with increasing depth $L$, the degeneration phenomenon may still occur. However, when $\alpha$ decreases rapidly with $L$, the kernel regression with deep RNTK with early stopping can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space associated with the infinite-depth RNTK. Our simulation studies on synthetic data and real classification tasks such as MNIST, CIFAR10 and CIFAR100 support our theoretical criteria for choosing $\alpha$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
  2. On Exact Computation with an Infinitely Wide Neural Net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  3. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017.
  4. Spectral analysis of the neural tangent kernel for deep residual networks. arXiv preprint arXiv:2104.03093, 2021.
  5. Optimal Rates for Regularization of Statistical Inverse Learning Problems. Foundations of Computational Mathematics, 18(4):971–1013, August 2018.
  6. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  7. On Lazy Training in Differentiable Programming. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  8. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
  9. Algorithm-dependent generalization bounds for overparameterized deep residual networks. Advances in neural information processing systems, 32, 2019.
  10. Learning to optimize multigrid pde solvers. In International Conference on Machine Learning, pages 2415–2423. PMLR, 2019.
  11. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  13. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
  14. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
  15. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. arXiv preprint arXiv:1905.11368, 2019.
  16. Why do deep residual networks generalize better than deep feedforward networks?–a neural tangent kernel perspective. arXiv preprint arXiv:2002.06262, 2020.
  17. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  18. Generalization ability of wide neural networks on ℝℝ\mathbb{R}blackboard_R. arXiv preprint arXiv:2302.05933, 2023.
  19. Generalization ability of wide residual networks, 2023.
  20. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32:8572–8583, 2019.
  21. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
  22. On tighter generalization bound for deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159, 2018.
  23. Statistical optimality of deep wide neural networks, 2023.
  24. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017.
  25. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48(3):868–890, May 2020.
  26. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428–10436, 2020.
  27. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1):335–366, 2014.
  28. Residual mlp network for mental fatigue classification in mining workers from brain data. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pages 407–412. IEEE, 2019.
  29. Support vector machines. Springer Science & Business Media, 2008.
  30. platform-aware neural architecture search for mobile. 2019 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2815–2823, 2019.
  31. Kernel-based smoothness analysis of residual networks. In Mathematical and Scientific Machine Learning, pages 921–954. PMLR, 2022.
  32. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
  33. Residual attention network for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2017.
  34. Larnet: Lie algebra residual network for face recognition. In International Conference on Machine Learning, pages 11738–11750. PMLR, 2021.
  35. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
  36. Handwritten digit string recognition by combination of residual network and rnn-ctc. In International conference on neural information processing, pages 583–591. Springer, 2017.
  37. Residual learning without normalization via better initialization. In International Conference on Learning Representations, volume 3, page 2, 2019.
  38. Stabilize deep resnet with a sharp scaling factor τ𝜏\tauitalic_τ. Machine Learning, 111(9):3359–3392, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets