Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep vs. shallow networks : An approximation theory perspective (1608.03287v1)

Published 10 Aug 2016 in cs.LG and math.FA

Abstract: The paper briefy reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

Citations (332)

Summary

  • The paper demonstrates that deep networks approximate compositional functions with far fewer parameters compared to shallow architectures.
  • The study employs approximation theory and the concept of relative dimension to explain efficiency gains in deep convolutional networks.
  • The findings have practical implications for designing resource-efficient models that excel in high-dimensional function approximation tasks.

Deep vs. Shallow Networks: An Approximation Theory Perspective

The paper, "Deep vs. Shallow Networks: an Approximation Theory Perspective," provides an analytical framework for understanding the comparative advantages of deep neural networks, particularly deep convolutional neural networks (DCNNs), over shallow neural architectures in terms of function approximation efficacy. Primarily, it seeks to elucidate the conditions under which deep architectures significantly outperform shallow ones, leveraging concepts from approximation theory.

The discussion is grounded in the notion of function compositionality, where a function can be decomposed into a hierarchy of simpler functions. The paper posits that deep networks excel particularly in approximating compositional functions—functions that naturally align with a multi-layered or hierarchical structure. This is contrasted against shallow networks which, while capable of universal function approximation, face dimensionality constraints and subsequently require exponentially more parameters to achieve the same approximation accuracy for a large class of functions.

The theoretical results include new insights regarding the performance of deep networks using ReLU and Gaussian activation functions. The paper introduces the concept of relative dimension to capture the intrinsic advantage deep networks have over shallow networks. This measure reflects how deep architectures can efficiently exploit function sparsity or compositionality—capturing more complex classes of functions without a proportionate increase in complexity or parameter count.

Significant numerical distinctions are presented. For instance, approximating a target function with a shallow network demands O(ϵq/r)\mathcal{O}(\epsilon^{-q/r}) parameters to achieve an accuracy of ϵ\epsilon, where qq is the input dimension, and rr denotes the smoothness order of the function. Alternatively, a deep network respecting the compositional structure reduces this parameter necessity to O(ϵ2/r)\mathcal{O}(\epsilon^{-2/r}), offering substantial efficiency gains, especially in high-dimensional input spaces.

The implications are far-reaching, extending beyond theoretical interest to practical applications in machine learning and AI, where exploiting such compositional structures may lead to models that are both more powerful and more resource-efficient. The results prompt further exploration into optimal network design and the identification of architectures that best align with the inherent structure of specific function classes.

In conclusion, the paper provides a compelling argument for the practical and theoretical preference for deep architectures in learning tasks entailing compositional representations. It opens avenues for future research focused on formalizing the principles of hierarchical learning more universally across AI applications, suggesting a deeper theoretical framework bridging neural network architecture with the mathematical underpinnings of function approximation.