Accelerated Neural Network Training with Rooted Logistic Objectives (2310.03890v1)
Abstract: Many neural networks deployed in the real world scenarios are trained using cross entropy based loss functions. From the optimization perspective, it is known that the behavior of first order methods such as gradient descent crucially depend on the separability of datasets. In fact, even in the most simplest case of binary classification, the rate of convergence depends on two factors: (1) condition number of data matrix, and (2) separability of the dataset. With no further pre-processing techniques such as over-parametrization, data augmentation etc., separability is an intrinsic quantity of the data distribution under consideration. We focus on the landscape design of the logistic function and derive a novel sequence of {\em strictly} convex functions that are at least as strict as logistic loss. The minimizers of these functions coincide with those of the minimum norm solution wherever possible. The strict convexity of the derived function can be extended to finetune state-of-the-art models and applications. In empirical experimental analysis, we apply our proposed rooted logistic objective to multiple deep models, e.g., fully-connected neural networks and transformers, on various of classification benchmarks. Our results illustrate that training with rooted loss function is converged faster and gains performance improvements. Furthermore, we illustrate applications of our novel rooted loss function in generative modeling based downstream applications, such as finetuning StyleGAN model with the rooted loss. The code implementing our losses and models can be found here for open source software development purposes: https://anonymous.4open.science/r/rooted_loss.
- Generalization and equilibrium in generative adversarial nets (GANs). In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224–232. PMLR, 06–11 Aug 2017.
- A. Asuncion and D. Newman. Uci machine learning repository, 2007.
- P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Stochastic first-order methods for convex and nonconvex functional constrained optimization. Mathematical Programming, 197(1):215–279, 2023.
- Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Empirical study of the benefits of overparameterization in learning latent variable models. In International Conference on Machine Learning, pages 1211–1219. PMLR, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Generalization error of generalized linear models in high dimensions. In International Conference on Machine Learning, pages 2892–2901. PMLR, 2020.
- Condition number analysis of logistic regression, and its implications for standard first-order solution methods. arXiv preprint arXiv:1810.08727, 2018.
- Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
- Deep learning. MIT press, 2016.
- Near-optimal learning with average h\\\backslash\" older smoothness. arXiv preprint arXiv:2302.06005, 2023.
- Sparse regression at scale: Branch-and-bound rooted in first-order optimization. Mathematical Programming, 196(1-2):347–388, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Cut your losses with squentropy. ICML, 2023.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
- Training generative adversarial networks with limited data. In Proc. NeurIPS, 2020.
- Flickr faces hq (ffhq) 70k from stylegan. CoRR, 2018.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
- A multi-class hinge loss for conditional gans. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1290–1299, 2021.
- Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
- Learning multiple layers of features from tiny images. 2009.
- Data-dependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory, 65(5):2995–3021, 2019.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- Machine learning: a first course for engineers and scientists. Cambridge University Press, 2022.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- M. A. mnmoustafa. Tiny imagenet, 2017.
- M. L. Overton. Numerical computing with IEEE floating point arithmetic. SIAM, 2001.
- Automatic differentiation in pytorch. 2017.
- T. Pranckevičius and V. Marcinkevičius. Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic Journal of Modern Computing, 5(2):221, 2017.
- G.-J. Qi. Loss-sensitive generative adversarial networks on lipschitz densities. International Journal of Computer Vision, 128(5):1118–1140, 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- O. Shamir. Gradient methods never overfit on separable data. The Journal of Machine Learning Research, 22(1):3847–3866, 2021.
- C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
- Binary face image recognition using logistic regression and neural network. In 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), pages 3883–3888. IEEE, 2017.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- P. Sur and E. J. Candès. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
- Transformers as support vector machines. arXiv preprint arXiv:2308.16898, 2023.
- Gradient normalization for generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6373–6382, 2021.
- Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International conference on learning representations, 2020.
- M. Zhang and K. Liu. On regularized sparse logistic regression. arXiv preprint arXiv:2309.05925, 2023.