2000 character limit reached
Global Convergence Rate of Deep Equilibrium Models with General Activations (2302.05797v3)
Published 11 Feb 2023 in stat.ML and cs.LG
Abstract: In a paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation. They proved that the gradient descent converges to a globally optimal solution for the quadratic loss function at a linear convergence rate. This paper shows that this fact still holds for DEQs with any generally bounded activation with bounded first and second derivatives. Since the new activation function is generally non-homogeneous, bounding the least eigenvalue of the Gram matrix of the equilibrium point is particularly challenging. To accomplish this task, we must create a novel population Gram matrix and develop a new form of dual activation with Hermite polynomial expansion.
- On exact computation with an infinitely wide neural net. In Neural Information Processing Systems, 2019.
- Deep equilibrium models. ArXiv, abs/1909.01377, 2019.
- Multiscale deep equilibrium models. ArXiv, abs/2006.08656, 2020.
- Generalization Performance of Support Vector Machines and Other Pattern Classifiers, pp. 43–54. MIT Press, 1999.
- Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651 – 1686, 1998.
- F. Biggs and B. Guedj. Differentiable PAC-Bayes objectives with partially aggregated neural networks. Entropy, 23, 2021.
- From average case complexity to improper learning complexity. Proceedings of the forty-sixth annual ACM symposium on Theory of computing, 2013.
- Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, 2018.
- Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Uncertainty in Artificial Intelligence (UAI), 2017.
- Generalization error bounds via Rényi-f-divergences and maximal leakage. IEEE Transactions on Information Theory, 67(8):4986–5004, 2021.
- Gradient descent optimizes infinite-depth relu implicit networks with linear widths. ArXiv, abs/2205.07463, 2022.
- A global convergence theory for deep relu implicit networks via over-parameterization. International Conference on Learning Representations (ICLR), 2022.
- Wide neural networks as gaussian processes: Lessons from deep equilibrium models. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Deep equilibrium architectures for inverse problems in imaging. IEEE Transactions on Computational Imaging, 7:1123–1133, 2021.
- Matrix Analysis. Cambridge University Press, 1985.
- Generalization Error in Deep Learning. Arxiv: 1808.01174, 30, 2018.
- Cryptographic limitations on learning boolean formulae and finite automata. In JACM, 1994.
- V. Koltchinskii and D. Panchenko. Empirical Margin Distributions and Bounding the Generalization Error of Combined Classifiers. The Annals of Statistics, 30(1):1 – 50, 2002.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60:84 – 90, 2012.
- Learning overparameterized neural networks via stochastic gradient descent on structured data. ArXiv, abs/1808.01204, 2018.
- Global convergence of over-parameterized deep equilibrium models. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
- Quynh N. Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. ArXiv, abs/2101.09612, 2021.
- G. Szego. Orthogonal polynomials. American Mathematical Society, 1959.
- T. Tao. Topics in Random Matrix Theory. American Mathematical Society, 2012.
- Lan V. Truong. Generalization error bounds on deep learning with markov datasets. Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS), 2022a.
- Lan V. Truong. On rademacher complexity-based generalization bounds for deep learning. ArXiv, abs/2208.04284, 2022b.
- V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
- Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3604–3616, 2022.
- A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances of Neural Information Processing Systems (NIPS), 2017.