Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalization on the Unseen, Logic Reasoning and Degree Curriculum (2301.13105v3)

Published 30 Jan 2023 in cs.LG and stat.ML

Abstract: This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for sparse functions and a class of network models including instances of Transformers, random features models, and linear networks, a min-degree-interpolator is learned on the unseen. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements. These findings lead to two implications: (1) we provide an explanation to the length generalization problem for Boolean functions (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports. Finally, we discuss extensions to other models or non-sparse regimes where the min-degree bias may still occur or fade, as well as how it can be potentially corrected when undesirable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. The staircase property: How hierarchical structure can guide deep learning, NeurIPS, 2021.
  2. Learning to reason with neural networks: Generalization, unseen data and boolean measures. arXiv preprint arXiv:2205.13647, 2022a.
  3. The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks, COLT, 2022b.
  4. An initial alignment between neural network and target is needed for gradient descent to learn, 2022c. URL https://arxiv.org/abs/2202.12846.
  5. Revisiting neural scaling laws in language and vision. arXiv preprint arXiv:2209.06640, 2022.
  6. Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901, 2022.
  7. Implicit regularization in deep matrix factorization, 2019. URL https://arxiv.org/abs/1905.13655.
  8. Phyre: A new benchmark for physical reasoning. Advances in Neural Information Processing Systems, 32, 2019.
  9. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
  10. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006.
  11. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp.  41–48, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380. URL https://doi.org/10.1145/1553374.1553380.
  12. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, volume 267, 2019.
  13. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  14. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp.  1305–1338. PMLR, 2020.
  15. A mathematical model for curriculum learning. arXiv preprint arXiv:2301.13833, 2023.
  16. Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra. Springer Science & Business Media, 2013.
  17. The devil is in the detail: Simple tricks improve systematic generalization of transformers. arXiv preprint arXiv:2108.12284, 2021.
  18. Learning parities with neural networks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  20356–20365. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/eaae5e04a259d09af85c108fe4d7dd0c-Paper.pdf.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  20. Abstract algebra, volume 3. Wiley Hoboken, 2004.
  21. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  22. Limitations of lazy training of two-layers neural network. Advances in Neural Information Processing Systems, 32, 2019.
  23. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
  24. Implicit regularization in matrix factorization, 2017. URL https://arxiv.org/abs/1705.09280.
  25. Characterizing implicit bias in terms of optimization geometry, 2018a. URL https://arxiv.org/abs/1802.08246.
  26. Implicit bias of gradient descent on linear convolutional networks, 2018b. URL https://arxiv.org/abs/1806.00468.
  27. Deep models of interactions across sets. In International Conference on Machine Learning, pp. 1909–1918. PMLR, 2018.
  28. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757–795, 2020.
  29. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  30. Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933, 2021.
  31. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2901–2910, 2017.
  32. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pp. 10697–10707. PMLR, 2022.
  33. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  34. Curriculum learning and minibatch bucketing in neural machine translation. arXiv preprint arXiv:1707.09533, 2017.
  35. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pp. 2873–2882. PMLR, 2018.
  36. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.
  37. Towards better out-of-distribution generalization of neural algorithmic reasoning tasks. ArXiv, 2211.00692, 2022. URL https://arxiv.org/abs/2211.00692.
  38. Quantifying the benefit of using differentiable learning over tangent kernels. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  7379–7389. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/malach21a.html.
  39. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  40. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  41. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  42. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pp. 7721–7735. PMLR, 2021.
  43. The construction of multivariate polynomials with preassigned zeros. In European Computer Algebra Conference, pp.  24–31. Springer, 1982.
  44. Implicit bias in deep linear classification: Initialization scale vs training accuracy, 2020. URL https://arxiv.org/abs/2007.06738.
  45. O’Donnell, R. Analysis of Boolean Functions. Cambridge University Press, 2014. doi: 10.1017/CBO9781139814782.
  46. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  47. Competence-based curriculum learning for neural machine translation. arXiv preprint arXiv:1903.09848, 2019.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
  49. On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. PMLR, 2019.
  50. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.
  51. Equivariance through parameter-sharing. In International conference on machine learning, pp. 2892–2901. PMLR, 2017.
  52. Implicit regularization in deep learning may not be explainable by norms, 2020. URL https://arxiv.org/abs/2005.06398.
  53. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv preprint arXiv:2004.11829, 2020.
  54. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
  55. The implicit bias of gradient descent on separable data, 2017. URL https://arxiv.org/abs/1710.10345.
  56. From baby steps to leapfrog: How “less is more” in unsupervised dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.  751–759, 2010.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. The clrs algorithmic reasoning benchmark. arXiv preprint arXiv:2205.15659, 2022.
  59. A fine-grained analysis on distribution shift. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Dl4LetuLdyK.
  60. Frequency principle: Fourier analysis sheds light on deep neural networks, 2019.
  61. A unifying view on implicit bias in training linear neural networks. arXiv preprint arXiv:2010.02501, 2020.
  62. Deep sets. Advances in neural information processing systems, 30, 2017.
  63. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
  64. Pointer value retrieval: A new benchmark for understanding the limits of neural network generalization. ArXiv, abs/2107.12580, 2021.
  65. Unveiling transformers with lego: a synthetic reasoning task. arXiv preprint arXiv:2206.04301, 2022.
  66. Meta-learning symmetries by reparameterization. arXiv preprint arXiv:2007.02933, 2020.
Citations (41)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets