Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization (2402.10474v2)

Published 16 Feb 2024 in cs.LG and stat.ML

Abstract: We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, $\lambda f(w)$, for some convex function $f(\cdot)$, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion $c$ of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when $f(\cdot) = |\cdot|2_2$ and $\lambda \to \infty$. We then proceed to analyze the classification errors for $f(\cdot) = |\cdot|1$ and $f(\cdot) = |\cdot|\infty$ in the large $\lambda$ regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to $f(\cdot) = |\cdot|_22$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60.
  2. Regularized linear regression for binary classification. arXiv preprint arXiv:2311.02270.
  3. Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations (ICLR).
  4. Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7717–7727.
  5. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180.
  6. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. arXiv preprint arXiv:1804.09753.
  7. Deep neural networks for youtube recommendations. 10th ACM conference on recommender systems, pages 191–198.
  8. Universality laws for gaussian mixtures in generalized linear models. arXiv preprint arXiv:2302.08933.
  9. Deng, J. et al. (2009). Imagenet: A large-scale hierarchical image database. IEEE conference on computer vision and pattern recognition, pages 248–255.
  10. A model of double descent for high-dimensional binary linear classification. arXiv preprint arXiv:1911.05822.
  11. A model of double descent for high-dimensional binary linear classification. Information and Inference: A Journal of the IMA, 11(2):435–495.
  12. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5.
  13. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
  14. Qmoe: Practical sub-1-bit compression of trillion-parameter models. arXiv preprint arXiv:2310.16795.
  15. Gordon, Y. (1985). Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50:265–289.
  16. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1827–1836.
  17. Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468.
  18. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6152–6160.
  19. Huang, H. (2017). Asymptotic behavior of support vector machine for spiked population model. Journal of Machine Learning Research, 18(1):1472–1492.
  20. Categorical reparameterization with gumbel-softmax. In arXiv preprint arXiv:1611.01144.
  21. On the precise error analysis of support vector machines. arXiv preprint arXiv:2003.12972.
  22. Lolas, P. (2020). Regularization in high-dimensional regression and classification via random matrix theory. arXiv preprint arXiv:2003.13723.
  23. Learning gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions. Advances in Neural Information Processing Systems, 34:10144–10157.
  24. The role of regularization in classification of high-dimensional noisy gaussian mixture. arXiv preprint arXiv:2002.11544.
  25. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544.
  26. A universal analysis of large-scale regularized least squares solutions. Advances in Neural Information Processing Systems, 30.
  27. The impact of regularization on high-dimensional logistic regression. Advances in Neural Information Processing Systems, 32.
  28. Pb-llm: Partially binarized large language models. arXiv e-prints, pages arXiv–2310.
  29. Stojnic, M. (2013). A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291.
  30. Generalization for multiclass classification with overparameterized linear models. In Advances in Neural Information Processing Systems.
  31. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525.
  32. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 33:3104–3112.
  33. Sharp asymptotics and optimal performance for inference in binary models. Proceedings of Machine Learning Research, pages 3739–3749.
  34. Precise error analysis of regularized m𝑚mitalic_m-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592–5628.
  35. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory, pages 1683–1709. PMLR.
  36. Theoretical insights into multiclass classification: A high-dimensional asymptotic view. Advances in Neural Information Processing Systems, 33:8907–8920.
  37. Symbol error rate performance of box-relaxation decoders in massive mimo. IEEE Transactions on Signal Processing, 66(13):3377–3392.
  38. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453.
  39. Benign overfitting in multiclass classification: All roads lead to interpolation. arXiv e-prints, art. arXiv:2106.10865.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com