Transgressing the boundaries: towards a rigorous understanding of deep learning and its (non-)robustness (2307.02454v1)
Abstract: The recent advances in machine learning in various fields of applications can be largely attributed to the rise of deep learning (DL) methods and architectures. Despite being a key technology behind autonomous cars, image processing, speech recognition, etc., a notorious problem remains the lack of theoretical understanding of DL and related interpretability and (adversarial) robustness issues. Understanding the specifics of DL, as compared to, say, other forms of nonlinear regression methods or statistical learning, is interesting from a mathematical perspective, but at the same time it is of crucial importance in practice: treating neural networks as mere black boxes might be sufficient in certain cases, but many applications require waterproof performance guarantees and a deeper understanding of what could go wrong and why it could go wrong. It is probably fair to say that, despite being mathematically well founded as a method to approximate complicated functions, DL is mostly still more like modern alchemy that is firmly in the hands of engineers and computer scientists. Nevertheless, it is evident that certain specifics of DL that could explain its success in applications demands systematic mathematical approaches. In this work, we review robustness issues of DL and particularly bridge concerns and attempts from approximation theory to statistical learning theory. Further, we review Bayesian Deep Learning as a means for uncertainty quantification and rigorous explainability.
- The implicit regularization of stochastic gradient flow for least squares. In: Daumé III, H., Singh, A. (eds.), International Conference on Machine Learning, vol. 119, pp. 233–244. Cambridge MA: JMLR.
- Implicit gradient regularization. ArXiv preprint arXiv:2009.11162.
- Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854.
- The modern mathematics of deep learning. ArXiv preprint arXiv:2105.04026.
- Weight uncertainty in neural network. In: Bach, F., Blei, D. (eds.), International Conference on Machine Learning, vol. 37, pp. 1613–1622. Cambridge MA: JMLR.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
- Adversarial patch. ArXiv preprint arXiv:1712.09665.
- Learning the invisible: A hybrid deep learning-shearlet framework for limited angle computed tomography. Inverse Problems, 35(6):064002.
- Robustness of Bayesian neural networks to gradient-based attacks. Advances in Neural Information Processing Systems, 33:15602–15613.
- On evaluating adversarial robustness. ArXiv preprint arXiv:1902.06705.
- Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy, pp. 39–57. Los Alamitos, CA: IEEE.
- Multiple descent: Design your own generalization curve. Advances in Neural Information Processing Systems, 34:8898–8912.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303–314.
- Neural network approximation. Acta Numerica, 30:327–444.
- Sharp minima can generalize for deep nets. In: Precup, D., Teh, Y. W. (eds.), International Conference on Machine Learning, vol. 70, pp. 1019–1028. Cambridge MA: JMLR.
- Techniques for interpretable machine learning. Communications of the ACM, 63(1):68–77.
- Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380.
- How degenerate is the parametrization of neural networks with the ReLU activation function? Advances in Neural Information Processing Systems, 32:7790–7801.
- Prediction errors of molecular machine learning models lower than hybrid DFT error. Journal of Chemical Theory and Computation, 13(11):5255–5264.
- Liberty or depth: Deep Bayesian neural nets do not need complex weight posterior approximations. ArXiv preprint arXiv:2002.03704.
- Detecting adversarial samples from artifacts. ArXiv preprint arXiv:1703.00410.
- On the expressiveness of approximate inference in Bayesian neural networks. ArXiv preprint arXiv:1909.00719.
- Active Inference, Curiosity and Insight. Neural Computation, 29(10):2633–2683.
- Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: Balcan, M. F., Weinberger, K. Q. (eds.), International Conference on Machine Learning, vol. 48, pp. 1050–1059. Cambridge MA: JMLR.
- Sufficient conditions for idealised models to have no adversarial examples: a theoretical and empirical study with Bayesian neural networks. ArXiv preprint arXiv:1806.00667.
- Generalisation in humans and deep neural networks. Advanced in Neural Information Processing Systems, 31:7549–7561.
- Neural networks-based algorithms for stochastic control and PDEs in finance. ArXiv preprint arXiv:2101.08068.
- Géron, A. (2017). Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol: O’Reilly Media, 1st edition.
- Explaining and harnessing adversarial examples. ArXiv preprint arXiv:1412.6572.
- Hanna, J. F. (1983). Empirical adequacy. Philosophy of Science, 50(1):1–34.
- Harman, G. H. (1965). The inference to the best explanation. The Philosophical Review, 74(1):88–95.
- Variational characterization of free energy: Theory and algorithms. Entropy, 19(11).
- Deep-neural-network solution of the electronic Schrödinger equation. Nature Chemistry, 12(10):891–897.
- Deep learning: An introduction for applied mathematicians. SIAM Review, 61(4):860–891.
- Flat minima. Neural Computation, 9(1):1–42.
- Adversarial examples are not bugs, they are features. ArXiv preprint arXiv:1905.02175.
- What are Bayesian neural network posteriors really like? In: Meila, M., Zhang, T. (eds.), International Conference on Machine Learning, vol. 139, pp. 4629–4640. Cambridge MA: JMLR.
- Free energy minimization: A unified framework for modeling, inference, learning, and optimization. IEEE Signal Processing Magazine, 38(2):120–125.
- On large-batch training for deep learning: Generalization gap and sharp minima. ArXiv preprint arXiv:1609.04836.
- Adversarial examples in the physical world. In: Yampolskiy, R. V. (ed.), Artificial intelligence safety and security, pp. 99–112. New York: Chapman and Hall/CRC.
- Stochastic Approximation and Recursive Algorithms and Applications. New York: Springer.
- LeCun, Y. (1998). The MNIST database of handwritten digits. (http://yann. lecun. com/exdb/mnist/).
- Stochastic modified equations and adaptive stochastic gradient algorithms. In: Precup, D., Teh, Y. W. (eds.), International Conference on Machine Learning, vol. 70, pp. 2101–2110. Cambridge MA: JMLR.
- Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations. Journal of Machine Learning Research, 20(40):1–47.
- On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In: Abernethy, J., Agarwal, S. (eds.), Conference on Learning Theory, vol. 125, pp. 2683–2711. Cambridge MA: JMLR.
- Adv-BNN: Improved adversarial defense through robust Bayesian neural network. ArXiv preprint arXiv:1810.01279.
- Deep neural nets as a method for quantitative structure–activity relationships. Journal of Chemical Information and Modeling, 55(2):263–274.
- Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations.
- Continuous-time limit of stochastic gradient descent revisited. In: Neural Information Processing Systems (NIPS).
- McAuliffe, W. H. B. (2015). How did abduction get confused with inference to the best explanation? Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, 51:300–319.
- The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766.
- Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
- Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11):2.
- Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457).
- Solving high-dimensional Hamilton–Jacobi–Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial Differential Equations and Applications, 2(4):1–48.
- On the ability of the optimal perceptron to generalise. Journal of Physics A: Mathematical and General, 23(11):L581.
- Duldsamkeit und intellektuelle Verantwortlichkeit. In: Auf der Suche nach einer besseren Welt: Vorträge und Aufsätze aus dreißig Jahren, pp. 303–328. München: Piper.
- Quine, W. V. (1953). Two dogmas of empiricism. In: From a logical point of view: Nine logico-philosophical essays, vol. 566, pp. 20–46. Cambridge, MA: Harvard University Press.
- Adversarial phenomenon in the eyes of Bayesian deep learning. ArXiv preprint arXiv:1711.08244.
- Reichenbach, H. (1949). The Theory of Probability: An Inquiry Into the Logical and Mathematical Foundations of the Calculus of Probability. Berkeley: University of California Press.
- Overfitting in adversarially robust deep learning. In: Daumé III, H., Singh, A. (eds.), International Conference on Machine Learning, vol. 119, pp. 8093–8104. Cambridge MA: JMLR.
- Speech enhancement with stochastic temporal convolutional networks. In: Interspeech, pp. 4516–4520.
- VarGrad: A low-variance gradient estimator for variational inference. Advances in Neural Information Processing Systems, 33:13481–13492.
- Roberts, D. A. (2021). SGD implicitly regularizes generalization error. ArXiv preprint arXiv:2104.04874.
- U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cham: Springer.
- Adversarial training is a form of data-dependent operator norm regularization. Advances in Neural Information Processing Systems, 33:14973–14985.
- Multi-temporal land cover classification with sequential recurrent encoders. ISPRS International Journal of Geo-Information, 7(4):129.
- Sarker, I. H. (2021). Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2:420.
- Understanding machine learning: From theory to algorithms. New York: Cambridge University Press.
- On the origin of implicit regularization in stochastic gradient descent. ArXiv preprint arXiv:2101.12176.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878.
- Sterkenburg, T. F. (2019). Putnam’s diagonal argument and the impossibility of a universal learning machine. Erkenntnis, 84:633–656.
- The no-free-lunch theorems of supervised learning. Synthese, 17(4):519–541.
- Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826.
- Intriguing properties of neural networks. ArXiv preprint arXiv:1312.6199.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
- van Fraassen, B. C. (1980). The Scientific Image. New York: Oxford University Press.
- Vapnik, V. (1999). The nature of statistical learning theory. New York: Springer.
- On the convergence and robustness of adversarial training. In: Chaudhuri, K., Salakhutdinov, R. (eds.), International Conference on Machine Learning, vol. 97, pp. 6586–6595. Cambridge MA: JMLR.
- Bayesian learning via stochastic gradient Langevin dynamics. In: Getoor, L., Scheffer, T. (eds.), International Conference on Machine Learning, pp. 681–688. Cambridge MA: JMLR.
- How good is the Bayes posterior in deep neural networks really? In: Daumé III, H., Singh, A. (eds.), International Conference on Machine Learning, vol. 119, p. 10248–10259. Cambridge MA: JMLR.
- Wheeler, G. (2016). Machine epistemology and big data. In: McIntyre, L., Rosenburg, A. (eds.), Routledge Companion to Philosophy of Social Science. London: Taylor & Fracis.
- Wolpert, D. H. (1996). The lack of A priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390.
- Generalization and memorization: The bias potential model. In: Bruna, J., Hesthaven, J., Zdeborova, L. (eds.), Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, vol. 145, pp. 1013–1043. Cambridge MA: JMLR.
- Quality of uncertainty quantification for Bayesian neural network inference. ArXiv preprint arXiv:1906.09686.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
- AMAGOLD: Amortized Metropolis adjustment for efficient stochastic gradient MCMC. In: Chiappa, S., Calandra, R. (eds.), International Conference on Artificial Intelligence and Statistics, vol. 108, pp. 2142–2152. Cambridge MA: JMLR.
- Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36.
- Zimmermann, R. S. (2019). Comment on ‘Adv-BNN: Improved adversarial defense through robust Bayesian neural network’. ArXiv preprint arXiv:1907.00895.
- Carsten Hartmann (38 papers)
- Lorenz Richter (23 papers)