Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Provably Accurate Randomized Sampling Algorithm for Logistic Regression (2402.16326v3)

Published 26 Feb 2024 in stat.ML, cs.DS, and cs.LG

Abstract: In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. A. Alaoui and M. W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems, pages 775–783, 2015.
  2. Unsupervised feature selection for principal components analysis. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 61–69, 2008.
  3. A novel logistic regression model combining semi-supervised learning and active learning for disease classification. Scientific reports, 8(1):13009, 2018.
  4. Information-based optimal subdata selection for big data logistic regression. Journal of Statistical Planning and Inference, 209:112–122, 2020.
  5. An iterative, sketching-based framework for ridge regression. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 988–997, 2018.
  6. Randomized iterative algorithms for fisher discriminant analysis. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115, pages 239–249, 2019.
  7. Faster randomized infeasible interior point methods for tall/wide linear programs. Advances in Neural Information Processing Systems, 33:8704–8715, 2020.
  8. Faster randomized interior point methods for tall/wide linear programs. The Journal of Machine Learning Research, 23(1):15156–15203, 2022.
  9. Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6):54, 2017.
  10. M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 278–287, 2016.
  11. M. B. Cohen and R. Peng. Lp row sampling by lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 183–192, 2015.
  12. Dimensionality reduction for k𝑘kitalic_k-means clustering and low rank approximation. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, pages 163–172, 2015.
  13. Optimal approximate matrix product in terms of stable rank. In 43rd International Colloquium on Automata, Languages, and Programming, pages 11:1–11:14, 2016.
  14. On the convergence of inexact predictor-corrector methods for linear programming. In International Conference on Machine Learning, pages 5007–5038. PMLR, 2022.
  15. Feature space sketching for logistic regression. arXiv preprint arXiv:2303.14284, 2023.
  16. Single-label multi-class image classification by deep logistic regression. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3486–3493, 2019.
  17. P. Drineas and M. W. Mahoney. RandNLA: Randomized numerical linear algebra. Communications of the ACM, 59:80–90, 2016.
  18. P. Drineas and M. W. Mahoney. Lectures on randomized numerical linear algebra, volume 25 of The Mathematics of Data, IAS/Park City Mathematics Series. American Mathematical Society, 2018.
  19. Fast monte carlo algorithms for matrices I: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157, 2006a.
  20. Sampling algorithms for ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm, pages 1127–1136, 2006b.
  21. Robust logistic regression and classification. Advances in neural information processing systems, 27, 2014.
  22. Matrix computations, volume 3. Johns Hopkins University Press, 2012.
  23. P. J. Green. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society: Series B (Methodological), 46(2):149–170, 1984.
  24. R. K. Halder. Cardiovascular disease dataset. Kaggle Dataset, 2020. DOI: https://dx.doi.org/10.21227/7qm5-dz13.
  25. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
  26. Applied logistic regression, volume 398. John Wiley & Sons, 2013.
  27. A method for large-scale ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-regularized logistic regression. In AAAI, pages 565–571, 2007.
  28. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32, 2019.
  29. J. Liao and K.-V. Chin. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics, 23(15):1945–1951, 2007.
  30. M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
  31. M. W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences of the United States of America, 106(3):697–702, 2009.
  32. Coresets for classification–simplified and strengthened. Advances in Neural Information Processing Systems, 34:11643–11654, 2021.
  33. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020.
  34. On coresets for logistic regression. Advances in Neural Information Processing Systems, 31, 2018.
  35. Oblivious sketching for logistic regression. In International Conference on Machine Learning, pages 7861–7871. PMLR, 2021.
  36. Almost linear constant-factor sketching for ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and logistic regression. In The Eleventh International Conference on Learning Representations, 2023.
  37. D. S. Rosario. Highly effective logistic regression model for signal (anomaly) detection. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages V–817. IEEE, 2004.
  38. T. Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 143–152. IEEE, 2006.
  39. Y. Song and W. Dai. Deterministic subsampling for logistic regression with massive data. Computational Statistics, pages 1–24, 2022.
  40. Z. Song and Z. Yu. Oblivious sketching-based central path method for linear programming. In International Conference on Machine Learning, pages 9835–9847. PMLR, 2021.
  41. H. Wang. More efficient estimation for logistic regression with optimal subsamples. Journal of machine learning research, 20, 2019.
  42. H. Wang. Logistic regression for massive data with rare events. In International Conference on Machine Learning, pages 9829–9836. PMLR, 2020.
  43. Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522):829–844, 2018.
  44. Nonuniform negative sampling and log odds correction with rare events data. Advances in Neural Information Processing Systems, 34, 2021.
  45. D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2), 2014.
  46. I.-C. Yeh. Default of credit card clients. UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C55S3H.
  47. Efficient online learning for large-scale sparse kernel logistic regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1219–1225, 2012.

Summary

We haven't generated a summary for this paper yet.