Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Majority-of-Three: The Simplest Optimal Learner? (2403.08831v1)

Published 12 Mar 2024 in stat.ML, cs.LG, math.ST, and stat.TH

Abstract: Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke's algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm's error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. The one-inclusion graph algorithm is not always optimal. In The Thirty Sixth Annual Conference on Learning Theory, COLT 2023, volume 195 of Proceedings of Machine Learning Research, pages 72–88. PMLR, 2023.
  2. Optimal PAC bounds without uniform convergence. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1203–1223. IEEE Computer Society, 2023.
  3. A new PAC bound for intersection-closed concept classes. Machine Learning, 66(2):151–163, 2007.
  4. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
  5. Proper learning, Helly number, and an optimal SVM bound. In Conference on Learning Theory, pages 582–609. PMLR, 2020.
  6. Leo Breiman. Bagging predictors. Mach. Learn., 24(2):123–140, aug 1996.
  7. Malte Darnstädt. The optimal PAC bound for intersection-closed concept classes. Information Processing Letters, 115(4):458–461, 2015.
  8. Steve Hanneke. Theoretical Foundations of Active Learning. Doctoral thesis, Carnegie-Mellon University, Machine Learning Department, 2009.
  9. Steve Hanneke. The optimal sample complexity of PAC learning. The Journal of Machine Learning Research, 17(1):1319–1333, 2016.
  10. Steve Hanneke. Refined error bounds for several learning algorithms. The Journal of Machine Learning Research, 17(1):4667–4721, 2016.
  11. Predicting {{\{{0, 1}}\}}-functions on randomly drawn points. Information and Computation, 115(2):248–292, 1994.
  12. Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics and Probability Letters, 135:1–6, 2018.
  13. Kasper Green Larsen. Bagging is an optimal PAC learner. In The Thirty Sixth Annual Conference on Learning Theory, COLT 2023, volume 195 of Proceedings of Machine Learning Research, pages 450–468. PMLR, 2023.
  14. The one-inclusion graph algorithm is near-optimal for the prediction model of learning. IEEE Transactions on Information Theory, 47(3):1257–1261, 2001.
  15. Robert E. Schapire. The Design and Analysis of Efficient Learning Algorithms. ACM Doctoral Dissertation Awards. The MIT Press, 1992.
  16. Hans U Simon. An almost optimal PAC algorithm. In Conference on Learning Theory, pages 1552–1563. PMLR, 2015.
  17. Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  18. A class of algorithms for pattern recognition learning. Avtomatika i Telemekhanika, 25(6):937–945, 1964.
  19. Algorithms with complete memory and recurrent algorithms in the problem of learning pattern recognition. Avtomatika i Telemekhanika, pages 95–106, 1968.
  20. On uniform convergence of the frequencies of events to their probabilities. Teoriya Veroyatnostei i ee Primeneniya, 16(2):264–279, 1971.
  21. Theory of Pattern Recognition. Nauka, Moscow, 1974.
  22. Manfred K Warmuth. The optimal PAC algorithm. In International Conference on Computational Learning Theory, pages 641–642. Springer, 2004.
Citations (3)

Summary

  • The paper proves that the Majority-of-Three learner achieves an in-expectation error bound of O(d/n), matching known optimal predictors in the PAC framework.
  • The analysis establishes a high-probability bound of O((d/n)loglog(min{n/d,1/δ})+(1/n)log(1/δ)), with only a slight sub-optimality due to a loglog factor.
  • The results highlight that simple ensemble methods can nearly attain optimal generalization, spurring future research on tighter analyses and broader learning settings.

High-Probability and Expected Generalization Bounds for Majority-of-Three Learners

Introduction

This paper addresses the effectiveness of an extremely simple yet surprisingly powerful learning algorithm within the framework of Probably Approximately Correct (PAC) learning. This learning algorithm, dubbed Majority-of-Three, is based on taking the majority vote over three Empirical Risk Minimization (ERM) predictors, each trained on disjoint subsets of the training data. The paper's main contributions can be segmented into establishing both expected and high-probability upper bounds on the generalization error of the Majority-of-Three learner, thereby pushing the envelope of our understanding of the simplest forms of optimal learners within the realizable PAC setting.

Expected Generalization Bounds for Majority-of-Three

The first key result of the paper shows that Majority-of-Three achieves an optimal in-expectation generalization bound under the realizable PAC setting. Specifically, it is proven that:

  • Theorem 1: For a function class $\mc{F}$ with VC dimension dd, distribution PP, and target function $f^\star \in \mc{F}$, the in-expectation error of the Majority-of-Three learner is bounded above by O(d/n)O(d/n), where nn is the training sample size.

This result is significant as it demonstrates that the generalization error of Majority-of-Three, in expectation, matches that of the one-inclusion graph predictor, which is known to be optimal in this metric. The analysis builds on the notion of partitioning the input space into regions based on the probability of a single ERM learner erring on each point, and subsequently using a series of careful probabilistic arguments to bound the expected error over these regions.

High-Probability Generalization Bounds for Majority-of-Three

The paper further extends the analysis of Majority-of-Three to the high-probability regime, providing a near-optimal high-probability upper bound on its generalization error. The established bound can be summarized as follows:

  • Theorem 2: With probability at least 1δ1-\delta over the sampling of training data, the generalization error of Majority-of-Three is bounded above by O((d/n)loglog(min{n/d,1/δ})+(1/n)log(1/δ))O\left((d/n)\log\log(\min\{n/d,1/\delta\})+(1/n)\log(1/\delta)\right).

Although this bound introduces a loglog\log\log factor making it slightly sub-optimal compared to the known general lower bound for improper learners in the PAC model, it remains noteworthy owing to its optimality for a significant range of the parameter δ\delta, especially in the low δ\delta regime.

Implications and Future Work

The analysis and results of this paper have broad implications for the theory behind PAC learning, particularly in evaluating the complexity and effectiveness of learning algorithms. The optimality of Majority-of-Three in expectation and its near-optimality in high probability scenarios reveal that highly simple aggregation methods can approach the theoretical limits of learnability in the realizable PAC setting.

Given the sub-optimality of Majority-of-Three in the high-probability bound by a loglog\log\log factor, a natural question that arises is whether a tighter analysis could eliminate this gap, thereby establishing Majority-of-Three as an optimal learner in both expectation and high-probability regimes. Analyzing the optimality of Majority-of-Three, or similarly simple learning algorithms, under different models or assumptions (e.g., agnostic learning setting, non-uniform learnability) are also interesting directions for future research.

Summary

In summary, this paper introduces and thoroughly analyzes a fundamental yet powerful learner within the PAC framework, highlighting the potential of simple majority schemes to achieve near-optimal generalization performance. The pursuit of simplicity, coupled with theoretical rigor, may pave the way toward understanding the essential properties that govern the efficiency and effectiveness of learning algorithms.