Quadratic Neural Networks: One Hidden Layer

Updated 8 October 2025

One-hidden-layer quadratic neural networks are defined by a hidden layer that applies quadratic functions to inputs, enhancing nonlinearity over conventional linear mappings.
They achieve superior expressivity by efficiently approximating complex functions, such as XOR patterns and radial functions, with fewer units than traditional networks.
Their optimization landscape, refined via surrogate objectives and gradient descent dynamics, guarantees reliable parameter recovery and robust generalization.

A one-hidden-layer quadratic neural network is a neural architecture in which the transformation between the input and the hidden layer is governed by quadratic functions rather than purely affine mappings. This enhanced nonlinearity equips such networks with significantly greater expressive power compared to conventional networks with only linear activations at the hidden layer. Their paper spans theoretical work on optimization landscapes, generalization, learning dynamics, expressivity, and practical algorithm design. This article provides a comprehensive technical summary of one-hidden-layer quadratic neural networks based on the current state of research literature.

1. Mathematical Formulation and Expressive Capacity

The canonical one-hidden-layer quadratic neural network takes an input $x\in\mathbb{R}^d$ and computes a hidden layer with $m$ units, where each pre-activation is a quadratic function: $h_i(x) = x^{\top} Q_i x + w_i^{\top} x + b_i, \quad 1\leq i\leq m,$ with $Q_i\in\mathbb{R}^{d\times d}$ (symmetric), $w_i\in\mathbb{R}^d$ , $b_i\in\mathbb{R}$ . The output layer combines these (optionally after activation $\sigma$ ): $f(x) = a^\top \sigma(h(x)) + c,$ where $a\in\mathbb{R}^m$ and $c\in\mathbb{R}$ .

This quadratic formulation allows a single neuron to model decision boundaries that are general hyperquadrics, such as ellipsoids, paraboloids, and hyperbolas, rather than just hyperplanes as in standard architectures. As a consequence, one-hidden-layer quadratic networks can efficiently approximate certain function classes and solve representation tasks (e.g., XOR, concentric rings separation) that require deep or wide conventional networks (Fan et al., 2018).

— Theorem (Expressive Efficiency): There exists a radial function $\tilde{g}$ such that it can be approximated to within a fixed $L_2$ -error by a quadratic network with polynomially many (e.g., $\mathcal{O}(d^{3.75})$ ) hidden units, whereas exponentially many nodes ( $\geq ce^{cd}$ ) are needed for standard one-hidden-layer networks (Fan et al., 2018). — A ReLU-activated quadratic network can universally approximate any continuous radial function with just four neurons per layer (Fan et al., 2018).

2. Optimization Landscape and Loss Design

The learning of one-hidden-layer quadratic networks via empirical risk minimization displays distinct, mathematically tractable properties. Using the standard squared loss on Gaussian input distributions, the population risk decomposes as: $f(a, B) = \sum_{k=0}^{\infty} \widehat{\sigma}_k^2\, \left\| \sum_{i=1}^m {b_i^*}^{\otimes k} - \sum_{i=1}^m a_i b_i^{\otimes k} \right\|_F^2 + \text{const},$ where $\widehat{\sigma}_k$ are the Hermite coefficients of the activation (Ge et al., 2017).

This decomposition shows that empirical risk minimization can be interpreted as a series of simultaneous low-rank tensor decomposition problems. However, the natural landscape of the standard squared loss contains spurious local minima unless expressly modified.

To address this, a "landscape design" approach constructs a surrogate objective $G(B)$ with the properties:

All local minima are global and recover the true weights (up to permutation and sign),
All saddle points have strict negative curvature,
$G(B)$ and its gradient are sample-estimable and smooth (Ge et al., 2017).

In the special orthogonal case, $G(B)$ is a sum of smooth polynomial terms involving the inner products between the candidate weights and the data, combined with norm-regularization constraints: $G(B) = \mathrm{sign}(\widehat{\sigma}_4)\, \mathbb{E}\Big[ y \sum_{j\neq k} \varphi(b_j, b_k, x) \Big] - \mu\,\mathrm{sign}(\widehat{\sigma}_4) \mathbb{E}\Big[ y \sum_{j} \phi(b_j, x) \Big] + \lambda \sum_{i=1}^m (\|b_i\|^2 - 1)^2.$ On this landscape, stochastic gradient descent provably converges globally from generic initializations under polynomial sample complexity (Ge et al., 2017).

3. Learning Dynamics and Generalization

Gradient Descent Dynamics

Gradient descent and its variants are effective for optimizing both the standard and specially designed loss functions for quadratic networks. For landscape-designed objectives, SGD reliably escapes all saddle points and, due to the absence of spurious minima, recovers the ground truth parameters (Ge et al., 2017).

For quadratic networks with standard squared loss, the training risk landscape exhibits an "energy barrier" separating rank-deficient parameter regions from full-rank global minima. If the empirical risk is pushed below this barrier, gradient descent finds a full-rank minimizer aligned with the teacher (up to orthogonal transformations) (Gamarnik et al., 2019).

In the overparameterized regime ( $m \gg d$ ), the probability of spurious minima vanishes provided the sample size $n$ exceeds a critical threshold that depends linearly on $d$ and the teacher width $m^*$ : $n_c = d(m^* + 1) - \frac12 m^*(m^* + 1)$ Above this threshold, gradient descent reliably recovers the planted solution (Mannelli et al., 2020).

Sample Complexity and Generalization Bounds

The generalization properties of one-hidden-layer quadratic networks have been characterized using uniform convergence (capacity) bounds. Unlike networks with ReLU or other nonsmooth activations, quadratic activations (being analytic) allow uniform convergence and generalization assurances when one controls the spectral or Frobenius norm of the hidden layer weight matrix. Specifically,

Spectral norm control suffices for quadratic activations;
Frobenius norm control is required for nonsmooth activations;
Sample complexity bounds become width-independent for quadratic networks due to this smoothness (Vardi et al., 2022).

Statistical Query Lower Bounds

For the class of functions in the harmonic expansion up to degree $k$ , both the iteration and query complexity of gradient descent and any statistical query method are lower bounded by $n^{\Omega(k)}$ , confirming the near-optimality of gradient-based training in this setting (Vempala et al., 2018).

4. Interpretability, Structure, and Algorithm Design

Fuzzy Logic and Spectral Characterization

Quadratic neurons natively compute "fuzzy logic" operations: through their quadratic forms, they can implement nonlinear logic gates (e.g., XOR, fuzzy AND) within a single hidden unit (Fan et al., 2018). The structure of the implemented fuzzy operation is determined by the eigenvalue spectrum of the quadratic form's matrix. The number and sign of these eigenvalues allow for a taxonomy of fuzzy logic gates that a given network layer realizes. This provides a pathway to analyze and interpret network minima and structural properties in information-theoretic terms (entropy of "good minima", compositional measure $M$ for architectural quality).

Algorithmic Implications and Architecture

Efficient vectorization of forward and backward passes for quadratic layers is possible by exploiting symmetry and caching of matrix operations. Each forward computation requires the (possibly cached) evaluation of $x^\top Q x$ plus linear and bias terms; gradients require handling both linear and quadratic weights. The backward pass can be fully vectorized, achieving practical efficiency even in relatively large-scale settings (Noel et al., 2023).

Quadratic logistic regression and classification models can be implemented with only $d(d+1)/2$ extra parameters per unit due to matrix symmetry, and quadratic neurons enable a single-layer separation of any dataset with $\mathcal{C}$ bounded clusters using only $\mathcal{C}$ output units (Noel et al., 2023).

Convex Formulations

Quadratic neural networks permit a convex reformulation for certain loss functions and applications. By expressing the input-output mapping as a quadratic form in a lifted input space (augmented with a constant), the learning of the entire mapping reduces to constrained optimization of a positive semidefinite matrix (or the difference of two PSDs). This allows globally optimal solutions and architecture selection via decomposition of the learned matrix (Rodrigues et al., 2022).

5. Theoretical Extensions: Random Matrix and Large Deviation Analysis

Spectral Theory and Feature Propagation

The asymptotic spectrum of random feature matrices generated by applying smooth nonlinearities (e.g., quadratic) to random projections ( $Y = f(WX)$ ) can be characterized using the resolvent method and culminates in quartic self-consistent equations for the Stieltjes transform of the empirical spectral distribution. For quadratic activations with zero bias, the spectrum can—in special cases—match the Marchenko–Pastur law, ensuring well-conditioned propagation of features. Any nontrivial additive bias destroys isospectrality; there is always a global spectral distortion owing to the bias, which cannot be remedied by any choice of nonlinearity (Piccolo et al., 2021).

Large Deviations and Training Dynamics

Recent work has established rigorous quenched and annealed large deviation principles for the empirical trajectories of SGD-trained weights in one-hidden-layer quadratic networks. By modeling neuron weights as a sequence of interacting particles, the analysis yields explicit rate functions (in terms of relative entropies) that characterize the probability of rare, atypical training behavior as the network width and training steps grow. These results formalize the link between SGD training and mean-field McKean–Vlasov dynamics, demonstrating that rare event probabilities decay exponentially with the network width and training time (Hirsch et al., 14 Mar 2024).

6. Practical Implications and Modern Applications

Supervised Learning: Quadratic networks achieve superior expressive efficiency, allowing them to separate clusters or model complex nonlinear boundaries with far fewer units than conventional networks, particularly benefiting tasks such as pattern recognition, digit classification, and function regression (Fan et al., 2018, Rodrigues et al., 2022).
Control and System Identification: The analytical, closed-form quadratic mapping provides immediate support for system identification and Lyapunov-based control. The output's Lipschitz constant can be directly computed, with convex training providing global optima and certified robustness (Rodrigues et al., 2022).
Generalization and Regularization: Width-independent generalization can be achieved via spectral or Frobenius norm control, with quadratic networks benefiting from the smoothness of their activations (Vardi et al., 2022).
Architecture Discovery: Decomposition of the trained quadratic form yields the effective network structure (number and direction of hidden units), automatically adapting architecture complexity to data.

7. Bayesian, Statistical, and Information-Theoretic Analyses

The generalization behavior, including bias and variance decomposition, of Bayesian-trained one-hidden-layer quadratic networks can be captured by an effective action framework. In the proportional limit (width and sample size large with fixed ratio), finite-width corrections are encoded by a global rescaling (kernel renormalization) of the infinite-width kernel. This mechanism quantitatively reproduces the empirical generalization curves as verified on benchmark datasets (Baglioni et al., 19 Jan 2024). Information-theoretic analyses of quadratic networks’ loss landscapes via entropy-based measures deepen understanding of architectural choices and optimization (Fan et al., 2018).

In conclusion, one-hidden-layer quadratic neural networks serve as a rigorously studied, highly expressive, and computationally tractable generalization of shallow architectures. They combine provable optimization properties, statistically robust generalization guarantees, efficient algorithmic implementations, and enhanced interpretability—a confluence underpinned by decades of mathematical analysis and subject to ongoing advances in convex geometry, random matrix theory, and statistical learning.