One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization (2402.10474v2)
Abstract: We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, $\lambda f(w)$, for some convex function $f(\cdot)$, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion $c$ of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when $f(\cdot) = |\cdot|2_2$ and $\lambda \to \infty$. We then proceed to analyze the classification errors for $f(\cdot) = |\cdot|1$ and $f(\cdot) = |\cdot|\infty$ in the large $\lambda$ regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to $f(\cdot) = |\cdot|_22$.
- A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60.
- Regularized linear regression for binary classification. arXiv preprint arXiv:2311.02270.
- Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations (ICLR).
- Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7717–7727.
- Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180.
- The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. arXiv preprint arXiv:1804.09753.
- Deep neural networks for youtube recommendations. 10th ACM conference on recommender systems, pages 191–198.
- Universality laws for gaussian mixtures in generalized linear models. arXiv preprint arXiv:2302.08933.
- Deng, J. et al. (2009). Imagenet: A large-scale hierarchical image database. IEEE conference on computer vision and pattern recognition, pages 248–255.
- A model of double descent for high-dimensional binary linear classification. arXiv preprint arXiv:1911.05822.
- A model of double descent for high-dimensional binary linear classification. Information and Inference: A Journal of the IMA, 11(2):435–495.
- CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5.
- Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
- Qmoe: Practical sub-1-bit compression of trillion-parameter models. arXiv preprint arXiv:2310.16795.
- Gordon, Y. (1985). Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50:265–289.
- Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1827–1836.
- Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468.
- Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6152–6160.
- Huang, H. (2017). Asymptotic behavior of support vector machine for spiked population model. Journal of Machine Learning Research, 18(1):1472–1492.
- Categorical reparameterization with gumbel-softmax. In arXiv preprint arXiv:1611.01144.
- On the precise error analysis of support vector machines. arXiv preprint arXiv:2003.12972.
- Lolas, P. (2020). Regularization in high-dimensional regression and classification via random matrix theory. arXiv preprint arXiv:2003.13723.
- Learning gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions. Advances in Neural Information Processing Systems, 34:10144–10157.
- The role of regularization in classification of high-dimensional noisy gaussian mixture. arXiv preprint arXiv:2002.11544.
- The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544.
- A universal analysis of large-scale regularized least squares solutions. Advances in Neural Information Processing Systems, 30.
- The impact of regularization on high-dimensional logistic regression. Advances in Neural Information Processing Systems, 32.
- Pb-llm: Partially binarized large language models. arXiv e-prints, pages arXiv–2310.
- Stojnic, M. (2013). A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291.
- Generalization for multiclass classification with overparameterized linear models. In Advances in Neural Information Processing Systems.
- A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525.
- Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 33:3104–3112.
- Sharp asymptotics and optimal performance for inference in binary models. Proceedings of Machine Learning Research, pages 3739–3749.
- Precise error analysis of regularized m𝑚mitalic_m-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592–5628.
- Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory, pages 1683–1709. PMLR.
- Theoretical insights into multiclass classification: A high-dimensional asymptotic view. Advances in Neural Information Processing Systems, 33:8907–8920.
- Symbol error rate performance of box-relaxation decoders in massive mimo. IEEE Transactions on Signal Processing, 66(13):3377–3392.
- Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453.
- Benign overfitting in multiclass classification: All roads lead to interpolation. arXiv e-prints, art. arXiv:2106.10865.