On the Expressive Power of Deep Polynomial Neural Networks (1905.12207v1)

Published 29 May 2019 in cs.LG, cs.NE, math.AG, and stat.ML

Abstract: We study deep neural networks with polynomial activations, particularly their expressive power. For a fixed architecture and activation degree, a polynomial neural network defines an algebraic map from weights to polynomials. The image of this map is the functional space associated to the network, and it is an irreducible algebraic variety upon taking closure. This paper proposes the dimension of this variety as a precise measure of the expressive power of polynomial neural networks. We obtain several theoretical results regarding this dimension as a function of architecture, including an exact formula for high activation degrees, as well as upper and lower bounds on layer widths in order for deep polynomials networks to fill the ambient functional space. We also present computational evidence that it is profitable in terms of expressiveness for layer widths to increase monotonically and then decrease monotonically. Finally, we link our study to favorable optimization properties when training weights, and we draw intriguing connections with tensor and polynomial decompositions.

Citations (78)

View on Semantic Scholar

Summary

The paper introduces the functional variety concept to measure network expressiveness by quantifying the dimension of representable polynomial functions.
It employs algebraic geometry tools and backpropagation over polynomial rings and finite fields to rigorously analyze and compute network properties.
The study identifies a bottleneck effect with narrow intermediate layers, linking network architecture to optimization challenges and tensor decompositions.

This paper investigates the expressive power of deep neural networks that use polynomial activation functions. Unlike general non-linear activations, using a polynomial activation (specifically, raising inputs to the $r$ -th power) makes the function computed by the network an algebraic mapping from weights to polynomials. This algebraic structure allows the authors to leverage tools from algebraic geometry to analyze the set of functions a network can represent.

The core concept introduced is the "functional variety" ( $\mathcal{V}_{\mathbf{d}, r}$ ). For a fixed network architecture $\mathbf{d} = (d_0, \dots, d_h)$ and activation degree $r$ , the set of all functions achievable by varying the network weights forms a set of polynomials, which is a subset of a finite-dimensional vector space of polynomials (specifically, $d_h$ output polynomials, each being a homogeneous polynomial of degree $r^{h-1}$ in $d_0$ input variables). The Zariski closure of this set is the functional variety $\mathcal{V}_{\mathbf{d}, r}$ . The dimension of this variety is proposed as a precise measure of the network's expressive power, quantifying the degrees of freedom in the functions it can represent. While the actual functional space $\mathcal{F}_{\mathbf{d}, r}$ might be smaller than its Zariski closure $\mathcal{V}_{\mathbf{d}, r}$ in the standard Euclidean topology, their dimensions are equal.

The paper establishes a connection between the functional variety's properties and the network's optimization landscape. If the functional variety $\mathcal{V}_{\mathbf{d}, r}$ "fills" the ambient space of polynomials (meaning its dimension equals the dimension of the polynomial space $Sym{r^{h-1}}{d_0}^{d_h}$ ), the actual functional space $\mathcal{F}_{\mathbf{d}, r}$ is "thick" (has positive Lebesgue measure). The authors prove that if the functional space is not thick, a convex loss function can exist with arbitrarily bad local minima (Proposition 3.1, 3.2). Conversely, architectures with thick/filling functional spaces are expected to have more favorable optimization properties.

The structure of polynomial networks is also related to tensor and polynomial decompositions. Shallow networks ( $h=2$ ) are shown to compute partially symmetric tensors that can be expressed as a sum of $d_1$ rank-1 terms, connecting to CP tensor decomposition. Deep polynomial networks correspond to an iterated form of such decompositions. The paper links the ability of a network to "fill" the ambient polynomial space to properties of these tensor/polynomial decompositions.

Key theoretical results focus on determining the dimension of $\mathcal{V}_{\mathbf{d}, r}$ :

Upper Bound: The dimension is bounded by the minimum of the number of parameters in the network, accounting for symmetries ( $\sum d_i d_{i-1} - \sum d_i = d_h + \sum_{i=1}^{h} (d_{i-1}-1)d_i$ ), and the dimension of the ambient polynomial space $d_h \binom{d_0 + r^{h-1} -1}{r^{h-1}}$ (Theorem 4.1).

$\dim \mathcal{V}_{\mathbf{d},r} \le \min\left(d_{h} + \sum_{i=1}^{h} (d_{i-1}-1) d_{i}, \, d_h \binom{d_0 + r^{h-1} -1}{r^{h-1}}\right)$
High Degree Limit: For a fixed architecture, the dimension achieves this upper bound related to the number of parameters when the activation degree $r$ is sufficiently high (Theorem 4.1).
Recursive Bound: The dimension of a deep network can be bounded based on the dimensions of shallower networks forming its parts (Proposition 4.3):

$\dim\mathcal{V}_{(d_0,\ldots,d_h),r} \, \leq \, \dim \mathcal{V}_{(d_0,\ldots,d_k),r} \, +\, \dim \mathcal{V}_{(d_k,\ldots,d_h),r} \, -\, d_k$
Bottleneck Property: The paper proves that an intermediate layer width $d_i = 2d_0 - 2$ acts as an "asymptotic bottleneck". This means that regardless of how wide other layers or how deep the network becomes, an intermediate layer with this width (or smaller) prevents the functional variety from filling the ambient space (Theorem 4.4).

To paper these dimensions computationally, the authors developed methods based on the fact that the dimension of the functional variety equals the generic rank of the Jacobian matrix of the parameter-to-function map $\Phi_{\mathbf{d}, r}$ (Lemma 3.3). They implemented this in SageMath, leveraging automatic differentiation (backpropagation).

Backpropagation over Polynomial Ring: Define the network operations using polynomials in the input variables $x_1, \dots, x_{d_0}$ . Backpropagating the output polynomials with respect to network weights yields polynomials whose coefficients form the Jacobian entries.
Backpropagation over Finite Field: Perform backpropagation over a finite field $\mathbb{Z}/p\mathbb{Z}$ at several random input points $x$ . The Jacobian entries can be recovered by solving a linear system. This approach is generally faster.

These computational experiments (examples shown in Tables 1 and 2) provide empirical insights:

The dimension of the functional variety stabilizes as the activation degree $r$ increases, consistent with the theoretical result.
Minimal architectures that fill the ambient space tend to have intermediate layer widths that are unimodal (increase then decrease) as depth increases. This computational finding leads to a conjecture that minimal filling architectures are unimodal.

The paper proposes polynomial networks as a testbed for understanding general deep nonlinear networks, arguing that insights gained here could transfer due to polynomial approximation properties (Stone-Weierstrass theorem). Future work could involve studying the geometry of functional varieties beyond dimension (e.g., degree), exploring other algebraic architectures (convolutions, different polynomial activations), and developing algorithmic methods for learning based on tensor decomposition principles.

In summary, the paper provides a rigorous algebraic framework for studying the expressiveness of polynomial neural networks, defines expressiveness via the dimension of the functional variety, connects this dimension to optimization properties and tensor decompositions, provides theoretical results on dimension and bottleneck widths, and demonstrates practical computational methods for dimension calculation using standard ML techniques like backpropagation adapted to algebraic settings. The findings offer insights into how network architecture, particularly layer widths and activation degrees, impacts the space of functions a network can represent.

PDF Markdown

On the Expressive Power of Deep Polynomial Neural Networks (1905.12207v1)

Summary

Related Papers