Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Matching the Statistical Query Lower Bound for $k$-Sparse Parity Problems with Sign Stochastic Gradient Descent (2404.12376v2)

Published 18 Apr 2024 in cs.LG, math.OC, and stat.ML

Abstract: The $k$-sparse parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-sparse parity problem with sign stochastic gradient descent, a variant of stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that this approach can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube ($k\leq O(\sqrt{d})$) with a sample complexity of $\tilde{O}(d{k-1})$ using $2{\Theta(k)}$ neurons, matching the established $\Omega(d{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the $k$-parity problem. We then demonstrate how a trained neural network with sign SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors. To the best of our knowledge, this is the first result that matches the SQ lower bound for solving $k$-sparse parity problem using gradient-based methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory. PMLR.
  2. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. In The Thirty Sixth Annual Conference on Learning Theory. PMLR.
  3. On the non-universality of deep learning: quantifying the cost of symmetry. Advances in Neural Information Processing Systems 35 17188–17201.
  4. Provable advantage of curriculum learning on parity targets with mixed inputs. In Thirty-seventh Conference on Neural Information Processing Systems.
  5. Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv preprint arXiv:2207.08799 .
  6. Blum, A. (2005). On-line algorithms in machine learning. Online algorithms: the state of the art 306–325.
  7. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory. PMLR.
  8. Learning parities with neural networks. Advances in Neural Information Processing Systems 33 20356–20365.
  9. The parametrized complexity of some fundamental problems in coding theory. SIAM Journal on Computing 29 545–570.
  10. Hardness of approximating the minimum distance of a linear code. IEEE Transactions on Information Theory 49 22–37.
  11. A tight lower bound for parity in noisy communication networks. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete Algorithms.
  12. Pareto frontiers in deep feature learning: Data, compute, width, and luck. Advances in Neural Information Processing Systems 36.
  13. Limit on the speed of quantum computation in determining parity. Physical Review Letters 81 5442.
  14. Random feature amplification: Feature learning and generalization in neural networks. arXiv preprint arXiv:2202.07626 .
  15. Glasgow, M. (2023). Sgd finds then tunes features in two-layer neural networks with near-optimal sample complexity: A case study in the xor problem. arXiv preprint arXiv:2309.15111 .
  16. Mean-field langevin dynamics and energy landscape of neural networks. arXiv preprint arXiv:1905.07769 .
  17. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
  18. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations.
  19. Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM) 45 983–1006.
  20. Toward attribute efficient learning of decision lists and parities. Journal of Machine Learning Research 7.
  21. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115 E7665–E7671.
  22. Benign overfitting in two-layer relu convolutional neural networks for xor data.
  23. Milne-Thomson, L. M. (2000). The calculus of finite differences. American Mathematical Soc.
  24. Feature learning via mean-field langevin dynamics: classifying sparse parities and beyond. In Thirty-seventh Conference on Neural Information Processing Systems.
  25. Telgarsky, M. (2022). Feature selection with gradient descent on two-layer networks in low-rotation regimes. arXiv preprint arXiv:2208.02789 .
  26. Vardy, A. (1997). The intractability of computing the minimum distance of a code. IEEE Transactions on Information Theory 43 1757–1766.
  27. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. Advances in Neural Information Processing Systems .
  28. Benign overfitting and grokking in relu networks for xor cluster data.

Summary

We haven't generated a summary for this paper yet.