Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs (2306.16921v1)
Abstract: Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the number of training steps with standard (bounded) learning rates on a common sample distribution: if the data distribution is a mixture of sparse and dense inputs, there exists a regime in which a 2-layer ReLU neural network trained by a curriculum noisy-GD (or SGD) algorithm that uses sparse examples first, can learn parities of sufficiently large degree, while any fully connected neural network of possibly larger width or depth trained by noisy-GD on the unordered samples cannot learn without additional steps. We also provide experimental results supporting the qualitative separation beyond the specific regime of the theoretical results.
- On the non-universality of deep learning: quantifying the cost of symmetry. arXiv preprint arXiv:2208.03113, 2022.
- The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
- Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. arXiv preprint arXiv:2302.11055, 2023.
- Generalization on the unseen, logic reasoning and degree curriculum. arXiv preprint arXiv:2301.13105, 2023.
- An initial alignment between neural network and target is needed for gradient descent to learn. In International Conference on Machine Learning, pages 33–52. PMLR, 2022.
- Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788–4838, 2021.
- Teaching by examples: Implications for the process of category acquisition. The Quarterly Journal of Experimental Psychology Section A, 50(3):586–606, 1997.
- On the power of differentiable learning versus PAC and SQ learning. In Advances in Neural Information Processing Systems, volume 34, 2021.
- On the universality of deep learning. In Advances in Neural Information Processing Systems, volume 33, pages 20061–20072, 2020.
- Poly-time universality and limitations of deep learning. to appear in CPAM, 2023.
- Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv preprint arXiv:2207.08799, 2022.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
- Webly supervised learning of convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 1431–1439, 2015.
- A mathematical model for curriculum learning. arXiv preprint arXiv:2301.13833, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Learning parities with neural networks. Advances in Neural Information Processing Systems, 33:20356–20365, 2020.
- The effects of information order and learning mode on schema abstraction. Memory & cognition, 12(1):20–30, 1984.
- Reverse curriculum generation for reinforcement learning. In Conference on robot learning, pages 482–495. PMLR, 2017.
- Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. PMLR, 2017.
- Curriculumnet: Weakly supervised learning from large-scale web images. In Proceedings of the European conference on computer vision (ECCV), pages 135–150, 2018.
- On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535–2544. PMLR, 2019.
- Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia, pages 547–556, 2014.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Curriculum learning and minibatch bucketing in neural machine translation. arXiv preprint arXiv:1707.09533, 2017.
- Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
- Self-paced learning for latent variable models. Advances in neural information processing systems, 23, 2010.
- Quantifying the benefit of using differentiable learning over tangent kernels. In International Conference on Machine Learning, pages 7379–7389. PMLR, 2021.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Computational separation between convolutional and fully-connected networks. arXiv preprint arXiv:2010.01369, 2020.
- Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research, 21(1):7382–7431, 2020.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Curriculum learning of multiple tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5492–5500, 2015.
- Competence-based curriculum learning for neural machine translation. arXiv preprint arXiv:1903.09848, 2019.
- Generalizing from the use of earlier examples in problem solving. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1):42, 1990.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- From baby steps to leapfrog: How “less is more” in unsupervised dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 751–759, 2010.
- A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive psychology, 71:55–89, 2014.
- Curriculum learning: A survey. International Journal of Computer Vision, pages 1–40, 2022.
- An analytical theory of curriculum learning in teacher–student networks. Journal of Statistical Mechanics: Theory and Experiment, 2022(11):114014, 2022.
- Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Data parameters: A new family of parameters for learning a differentiable curriculum. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval. arXiv preprint arXiv:1910.12837, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Theory of curriculum learning, with convex loss functions. Journal of Machine Learning Research, 21(222):1–19, 2020.
- Curriculum learning by transfer learning: Theory and experiments with deep networks. In International Conference on Machine Learning, pages 5238–5246. PMLR, 2018.
- A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- An empirical exploration of curriculum learning for neural machine translation. arXiv preprint arXiv:1811.00739, 2018.
- Learning to execute. arXiv preprint arXiv:1410.4615, 2014.