Hardness of Learning Fixed Parities with Neural Networks (2501.00817v2)

Published 1 Jan 2025 in cs.LG and stat.ML

Abstract: Learning parity functions is a canonical problem in learning theory, which although computationally tractable, is not amenable to standard learning algorithms such as gradient-based methods. This hardness is usually explained via statistical query lower bounds [Kearns, 1998]. However, these bounds only imply that for any given algorithm, there is some worst-case parity function that will be hard to learn. Thus, they do not explain why fixed parities - say, the full parity function over all coordinates - are difficult to learn in practice, at least with standard predictors and gradient-based methods [Abbe and Boix-Adsera, 2022]. In this paper, we address this open problem, by showing that for any fixed parity of some minimal size, using it as a target function to train one-hidden-layer ReLU networks with perturbed gradient descent will fail to produce anything meaningful. To establish this, we prove a new result about the decay of the Fourier coefficients of linear threshold (or weighted majority) functions, which may be of independent interest.

Summary

The paper demonstrates that one-hidden-layer ReLU networks need exponentially many PGD iterations to achieve even minimal loss reduction when learning fixed parity functions.
The analysis leverages Fourier coefficient decay to explain why gradients provide minimal information during the training process.
It shows that even single-neuron ReLU networks face exponential resource demands for achieving non-trivial learning outcomes on fixed parity tasks.

Hardness of Learning Fixed Parities with Neural Networks

This paper addresses a notable challenge in the field of learning theory: the difficulty of learning parity functions using neural networks, specifically when leveraging common optimization techniques like gradient descent. Despite the straightforward nature of parity functions, which can be described as a linear task over the GF(2) field, their inherent hardness is universally acknowledged, particularly when using gradient-based approaches. This work explores understanding the fundamental reasons why fixed parities—functions predetermined and unchangeable—pose a significant hurdle for gradient methods, even for small dimensional spaces.

Key Contributions

Demonstration of Hardness with One-Hidden-Layer Networks: The authors demonstrate that attempting to learn parity functions with a one-hidden-layer ReLU neural network using perturbed gradient descent (PGD) is futile, given certain sizes of parity functions. Specifically, the paper proves that exponentially many iterations of PGD are required even to achieve minimal reduction in expected loss.
Fourier Coefficient Analysis: A critical insight that supports the findings on learning hardness is the decay behavior of Fourier coefficients associated with linear threshold functions. The paper contributes an essential result indicating that these coefficients decay exponentially with respect to the size of the function's support set. This finding is essential to understanding the low information gain from gradients during PGD training processes, leading to high learning difficulty.
Learning Restrictions with a Single Neuron: The analysis extends to single-neuron ReLU networks, considering quadratic loss functions. The authors provide evidence that training these architectures with PGD also exhibits substantial limitations, requiring exponential resources in parity size to achieve non-trivial learning outcomes.

Implications and Theoretical Insights

The paper offers both theoretical and empirical perspectives that enrich our understanding of the limitations of neural architectures in learning fixed structures. The use of Fourier analysis bridges traditional harmonic approaches with practical machine learning insights, facilitating a deeper discourse on function learnability. The exponential decay of relevant Fourier coefficients serves as a theoretical cornerstone that could influence future studies on the adaptability and limitations of gradient dynamics in structured tasks.

Future Directions

While this research firmly establishes the challenges using traditional network architectures and optimization methods in learning certain fixed patterns, it opens avenues for exploring novel architectures and initialization strategies. One prospective area is the development of architectures that transcend these inherent categorical limits, possibly through alternative optimization frameworks beyond the parametric field of standard PGD or SGD. Another interesting direction could involve identifying the theoretical boundaries where learning parity functions becomes feasible without exponential blowup in computational demands.

Overall, this paper provides a foundational understanding of why certain straightforward computation tasks remain out of reach for simple neural approximators, guiding a robust conversation on efficiently overcoming inherent learning bottlenecks posed by problem-specific structuring.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JoshPurtell/status/1931450546696892628