Matching the Statistical Query Lower Bound for $k$-Sparse Parity Problems with Sign Stochastic Gradient Descent (2404.12376v2)
Abstract: The $k$-sparse parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-sparse parity problem with sign stochastic gradient descent, a variant of stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that this approach can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube ($k\leq O(\sqrt{d})$) with a sample complexity of $\tilde{O}(d{k-1})$ using $2{\Theta(k)}$ neurons, matching the established $\Omega(d{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the $k$-parity problem. We then demonstrate how a trained neural network with sign SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors. To the best of our knowledge, this is the first result that matches the SQ lower bound for solving $k$-sparse parity problem using gradient-based methods.
- The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory. PMLR.
- Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. In The Thirty Sixth Annual Conference on Learning Theory. PMLR.
- On the non-universality of deep learning: quantifying the cost of symmetry. Advances in Neural Information Processing Systems 35 17188–17201.
- Provable advantage of curriculum learning on parity targets with mixed inputs. In Thirty-seventh Conference on Neural Information Processing Systems.
- Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv preprint arXiv:2207.08799 .
- Blum, A. (2005). On-line algorithms in machine learning. Online algorithms: the state of the art 306–325.
- Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory. PMLR.
- Learning parities with neural networks. Advances in Neural Information Processing Systems 33 20356–20365.
- The parametrized complexity of some fundamental problems in coding theory. SIAM Journal on Computing 29 545–570.
- Hardness of approximating the minimum distance of a linear code. IEEE Transactions on Information Theory 49 22–37.
- A tight lower bound for parity in noisy communication networks. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete Algorithms.
- Pareto frontiers in deep feature learning: Data, compute, width, and luck. Advances in Neural Information Processing Systems 36.
- Limit on the speed of quantum computation in determining parity. Physical Review Letters 81 5442.
- Random feature amplification: Feature learning and generalization in neural networks. arXiv preprint arXiv:2202.07626 .
- Glasgow, M. (2023). Sgd finds then tunes features in two-layer neural networks with near-optimal sample complexity: A case study in the xor problem. arXiv preprint arXiv:2309.15111 .
- Mean-field langevin dynamics and energy landscape of neural networks. arXiv preprint arXiv:1905.07769 .
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
- Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations.
- Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM) 45 983–1006.
- Toward attribute efficient learning of decision lists and parities. Journal of Machine Learning Research 7.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115 E7665–E7671.
- Benign overfitting in two-layer relu convolutional neural networks for xor data.
- Milne-Thomson, L. M. (2000). The calculus of finite differences. American Mathematical Soc.
- Feature learning via mean-field langevin dynamics: classifying sparse parities and beyond. In Thirty-seventh Conference on Neural Information Processing Systems.
- Telgarsky, M. (2022). Feature selection with gradient descent on two-layer networks in low-rotation regimes. arXiv preprint arXiv:2208.02789 .
- Vardy, A. (1997). The intractability of computing the minimum distance of a code. IEEE Transactions on Information Theory 43 1757–1766.
- Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. Advances in Neural Information Processing Systems .
- Benign overfitting and grokking in relu networks for xor cluster data.