SGD Learns the Conjugate Kernel Class of the Network (1702.08503v2)

Published 27 Feb 2017 in cs.LG, cs.DS, and stat.ML

Abstract: We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of the network, as defined in Daniely, Frostig and Singer. The result holds for log-depth networks from a rich family of architectures. To the best of our knowledge, it is the first polynomial-time guarantee for the standard neural network learning algorithm for networks of depth more that two. As corollaries, it follows that for neural networks of any depth between $2$ and $\log(n)$, SGD is guaranteed to learn, in polynomial time, constant degree polynomials with polynomially bounded coefficients. Likewise, it follows that SGD on large enough networks can learn any continuous function (not in polynomial time), complementing classical expressivity results.

Citations (176)

View on Semantic Scholar

Summary

The paper demonstrates that Stochastic Gradient Descent (SGD) can learn functions in the conjugate kernel space of deep neural networks within polynomial time.
Key implications include showing that SGD can learn constant degree polynomials for networks up to log-depth, handling logical functions.
The findings provide theoretical guarantees for SGD in deep learning, suggesting potential for scaling to larger datasets and complex input spaces.

An Analysis of SGD Learning in Conjugate Kernel Spaces

The paper "SGD Learns the Conjugate Kernel Class of the Network" by Amit Daniely presents significant insights into the capabilities of Stochastic Gradient Descent (SGD) for learning functions within the framework of neural networks. The primary focus is on demonstrating that SGD can efficiently learn functions in the conjugate kernel space associated with the network's architecture. This achievement is particularly notable for neural networks with a depth greater than two, where prior polynomial-time guarantees were lacking.

Key Contributions

The research builds upon the foundational work of associating a reproducing kernel with network architectures, as outlined in previous literature. The main result is that for a variety of architectures, including fully connected and convolutional networks of log-depth, SGD is capable of learning any function in the respective kernel space within polynomial time. This ability is contingent upon appropriate specifications of the network size, step size, and number of SGD iterations, which are all scalable polynomially with respect to network parameters like function norm, input dimension, and the desired accuracy.

An additional corollary established in the paper is that SGD can learn constant degree polynomials with polynomially bounded coefficients for deep networks. Furthermore, for sufficiently large networks, SGD is theoretically capable of learning any continuous function, although this does not occur within polynomial time constraints but complements well-documented neural network expressivity results.

Implications and Analysis

With these findings, the paper provides much-needed theoretical guarantees on the use of SGD in supervised learning tasks involving complex network architectures. Two prominent implications arise from this work:

Constant Degree Polynomial Learning: For any network depth between 2 and $\log(n)$ , where $n$ is the input dimension, SGD can learn constant degree polynomials, allowing it to efficiently handle classes of logical functions such as conjunctions and DNF/CNF formulas under specific constraints. This insight aligns SGD's learning capabilities with other known poly-time learnable methods, barring special exceptions.
Universal Approximation Capabilities: In line with classical results, the paper underscores that neural networks are not only functionally expressive but also learn functional approximations using SGD, thereby enhancing the tangible prospects of applying deep networks to diverse practical problems.

Theoretical and Practical Speculations

The research suggests possibilities for further refining the empirical performance and theoretical understanding of SGD in deep learning. The degree of the polynomial bounds, while demonstratively valid, might be improved to yield more efficient learning. Also, there exists an open question regarding the extent to which SGD can learn beyond simple linear models, pushing the exploration toward more complex non-linear model classes.

The potential for transferring these theoretical insights to practical applications is immense. Neural networks' scalability and ability to generalize functions signify a future where SGD-driven learning could tackle larger datasets and more variably distributed input spaces efficiently.

Conclusion

In summary, the paper offers robust polynomial-time guarantees for SGD in learning specific classes within neural networks, addressing a crucial gap in theoretical understanding for architectures beyond shallow networks. As the field progresses, these insights could pave the way for more innovative applications and enhanced methodological frameworks in AI. The ongoing challenge remains in narrowing the gap between theoretical predictions and practical implementations, a frontier ripe for exploration in future artificial intelligence endeavors.