Quadratic and Diagonal Neural Networks

Updated 30 September 2025

Quadratic and Diagonal Neural Networks are families of architectures that use quadratic feature interactions and diagonal parameterization to enhance expressivity, efficiency, and implicit regularization.
Quadratic networks often utilize convex training protocols and stabilization techniques like ReLinear, while diagonal networks follow a lasso-like training path for sparse, interpretable solutions.
These neural architectures have achieved competitive performance in tasks such as classification, system identification, and sequential modeling with reduced parameter counts and improved optimization dynamics.

Quadratic and diagonal neural networks represent two structured families of neural architectures and parameterizations that exploit either second-order interactions or strict coordinatewise decoupling to achieve improved expressivity, efficiency, or implicit regularization. Each class has instigated distinct lines of research across expressivity, optimization, training dynamics, and applications.

1. Architectural Principles and Mathematical Definitions

Quadratic Neural Networks (QNNs) employ neurons or layers whose output is a quadratic function of the input rather than a purely linear combination. The general form is $f(x) = x^\top A x + b^\top x + c$ , where $A$ can be a full, low-rank, or diagonal matrix. Notable instantiations include:

Quadratic neurons: $y = x^\top W_q x + W_c x + b$ or (low-rank) $y = (W_a x)^\top (W_b x) + W_c x + b$ (Xu et al., 2023).
Decomposed quadratic neurons: $y = (w_a^\top x + b_a)(w_b^\top x + b_b) + w_c^\top x + b_c$ (Xu et al., 2020).
RQNN (Radial Quadratic Neural Network): $\Psi(x) = \sum_{i=1}^N \alpha_i \sigma(w_i^\top x + \xi_i^2 + \theta_i)$ , where quadratic terms offer compact, circular level sets (Frischauf et al., 19 Jan 2024).

Diagonal Neural Networks constrain the trainable weight matrices (or other maps) in a neural network to be diagonal, thereby enforcing coordinatewise independence in those connections or transformations. This appears in:

Diagonal RNNs: recurrent weights $W$ in $h_t = \sigma(W \odot h_{t-1} + Ux_t)$ are diagonal, so recurrence is elementwise (Subakan et al., 2017).
Diagonal conceptors: reservoir state filtering via a diagonal matrix, scaling each neural state independently (Jong, 2021).
Diagonal linear networks (DLNs): all hidden weights are diagonal, often resulting in $x = u \circ v$ , where “ $\circ$ ” denotes the Hadamard product (Berthier, 23 Sep 2025, Berthier, 2022).

2. Expressivity and Approximation Properties

Quadratic networks have been shown to be strictly more expressive than conventional (linear or affine) networks of similar size, often with significantly improved parametric efficiency:

Spline Representation: QNNs can exactly represent any univariate polynomial spline, while standard ReLU networks cannot reproduce splines with nonzero high-order terms (Fan et al., 2021).
Dimension of Function Spaces: Algebraic geometry analysis shows the dimension of function spaces representable by QNNs with quadratic neurons strictly exceeds that of conventional or quadratic-activated networks (Fan et al., 2021).
Parametric Efficiency: There exist functions in the quadratic Barron space $\mathcal{B}_2^{(2)}$ that are not in the conventional Barron space, meaning QNNs can approximate certain high-dimensional functions with dimension-free error, while linear networks require exponentially more parameters (Fan et al., 2023).
Manifold and Sobolev Approximation: QNNs match or outperform standard networks in approximating functions on manifolds, with parameter savings scaling exponentially in dimension (Fan et al., 2023).

Diagonal networks, due to their strict decoupling, are less expressive but offer clear interpretation and benefits in scenarios where independence across inputs or state variables is a modeling advantage. However, in diagonal RNNs, the simplicity does not preclude competitive performance: elementwise recurrence can still capture sufficient temporal dependencies for music modeling tasks (Subakan et al., 2017).

3. Training Dynamics, Regularization, and Optimization

Quadratic Networks:

Convex Training Protocols: In certain designs where the output is represented analytically as a quadratic form, training reduces to a convex optimization problem (e.g., minimizing a quadratic loss with regularization), yielding global minima and architectures discovered as a by-product (Rodrigues et al., 2022, Asri et al., 2023).
ReLinear Stabilization: The “ReLinear” approach regularizes gradient flow and shrinks quadratic weights’ learning rates or magnitudes, addressing instability from exponentially growing polynomial degrees and aiding convergence (Fan et al., 2021).
Natural Gradient and Block-Diagonal Scaling: Block-diagonal approximations to curvature matrices enable second-order optimization updates efficiently; diagonal rescaling recovers algorithms such as RMSProp or fan-in scaling as special cases (Lafond et al., 2017).
Catapult Dynamics: Quadratic teacher–student models can undergo catapult phase transitions similar to deep linear networks and generalize their convergence and instability analysis using tools from analytical geometry (e.g., the Łojasiewicz inequality (Zhu et al., 2022)).

Diagonal Networks:

Implicit $\ell_1$ Regularization and the Lasso Path: Gradient descent from small initialization in diagonal linear networks converges to the minimum $\ell_1$ -norm (sparse) solution among all data-fitting minimizers (Berthier, 23 Sep 2025). The full trajectory of training time traces the lasso regularization path: early stopping equates to larger regularization (more sparsity).
Incremental Learning and Activation: In models like the DLN, gradient flow starts from “dead” coordinates and incrementally activates them. The support of the estimate grows, mirroring coordinate-by-coordinate entry into the optimal solution set—a dynamical view of the lasso regularization path (Berthier, 2022).

Network Type	Regularization Path	Special Property
Diagonal DLN	Follows Lasso path	Minimum $\ell_1$ -norm bias
Quadratic NN	Problem dependent	Expressive, convex cases

4. Practical Deployment, Efficient Implementation, and Hardware Considerations

Parameter and Memory Efficiency:

Diagonalization and quadratic term factorization reduce parameter count from $O(n^2)$ to $O(n)$ or $O(3n)$ without appreciable loss in fit or generalization (Xu et al., 2023, Xu et al., 2020).
Libraries and frameworks such as QuadraLib provide drop-in quadratic layers, auto-conversion from first-order architectures, symbolic–automatic differentiation hybrid backpropagation, and memory-efficient activation management (Xu et al., 2022).
Hardware-aware NAS (Neural Architecture Search), as in QuadraNet, adapts layer/channel configurations and kernel sizes to achieve throughput gains (~1.5 $\times$ ), 30% lower memory, and no cognition loss compared to transformer-based models (Xu et al., 2023).

Transfer and Adaptation:

Quadratic adaptation modules (QuadraNet V2) allow pre-trained linear networks to be “upgraded” via learnable quadratic terms for modeling nonlinear domain shifts, significantly reducing computational cost for adaptation (~90–98.6% GPU hour savings) (Xu et al., 6 May 2024).

Estimation of Diagonal and Curvature Terms:

Efficient diagonal estimation via stochastic quadratic form queries enables approximate Hessian diagonal extraction using only function evaluations, critical for large-scale second-order optimization or diagonal preconditioning in neural training (Ye et al., 18 Jun 2025).

5. Applications and Empirical Results

Classification and Regression Tasks:

Shallow QNNs with quadratic (or compact/circular) decision functions achieve perfect accuracy on synthetic compact-cluster classification, outperforming affine networks with comparable depth, and are competitive or superior on MNIST digit separation (Frischauf et al., 19 Jan 2024).
Quadratic networks deployed in image classification (CIFAR, ImageNet), point cloud segmentation (S3DIS), and bearing failure diagnosis show consistently improved or matched accuracy at reduced parameter and computational budget (Fan et al., 2023, Xu et al., 2022).

System Identification and Control:

QNNs trained via convex optimization are used for system identification, regression, and controller synthesis (including Lyapunov stability guarantees), robust even with limited training data and enable analytic sensitivity bounding (Rodrigues et al., 2022, Asri et al., 2023).

Sequential Modeling and Memory:

Diagonal RNNs (diagonal recurrent matrices in VRNN, GRU, LSTM) yield better test likelihoods and convergence on music generation datasets, with fewer parameters and enhanced stability (Subakan et al., 2017).
Diagonal conceptors for RNNs cut storage and training cost, with comparable performance to full-matrix conceptors in learning, recall, and pattern morphing on temporal datasets (Jong, 2021). Careful parameter tuning is required for stability.

High-Dimensional and Theoretical Scaling Laws:

In high-dimensional teacher–student settings with quadratic activations, learning dynamics can be exactly characterized using matrix Riccati flows; risk decays with explicit power-law scaling that depends on the teacher's coefficient decay exponent (Arous et al., 5 Aug 2025).

6. Robustness, Generalization, and Theoretical Guarantees

Function Identification Principle: QNNs allow identification of the target function (in the sense of equivalence classes under $\theta\to Q\theta$ ) even where parameter identification is impossible, enabling robust generalization to arbitrary distribution shifts over the training domain (Xu et al., 2021).
Uniform Generalization Bounds: For quadratic and certain ReLU networks, uniform bounds on the prediction error can be ensured over all distributions supported on the bounded input set, independent of the divergence from the training distribution (Xu et al., 2021).
Implicit Bias and Early Stopping: In diagonal linear networks, early stopping is dynamically equivalent to high lasso regularization: the training time acts as an inverse regularization control (Berthier, 23 Sep 2025).

7. Limitations, Monotonicity, and Open Theoretical Aspects

The exact equivalence between the gradient-flow trajectory in DLNs and the lasso path requires coordinate-wise monotonicity of the lasso solutions; when not satisfied, deviations are quantifiable and can be empirically significant (Berthier, 23 Sep 2025).
Diagonal conceptors, while efficient, can be less stable than full-matrix conceptors; improvements require further investigation, particularly for complex tasks or when extrapolation between patterns is desired (Jong, 2021).
QNN performance is highly sensitive to initialization and requires careful regularization (e.g., ReLinear techniques) to avoid collapse or learn degenerate polynomials (Fan et al., 2021).
Although quadratic adaptation provides environmental and computational savings, the quadratic term introduces moderately increased overhead compared to diagonal or strictly linear adaptation; the power of diagonal adaptation is limited to linear regime data shifts (Xu et al., 6 May 2024).

Summary Table: Core Technical Contrasts

Aspect	Quadratic Neural Networks	Diagonal Neural Networks
Weight Structure	Full, low-rank, or Hadamard products	Strictly diagonal (elementwise weighting)
Expressivity	High (captures feature interactions)	Low (coordinatewise, linear/decoupled)
Optimization	Often convex (QNN architectures)	Implicit $\ell_1$ regularization (lasso path)
Main Regularization Path	Problem dependent, sometimes analytic	Exactly traces lasso path (time $\sim$ $1/\lambda$ )
Parametric Efficiency	Dimension-free for certain functions	Sparse solutions; reduced parameterization
Principal Limitations	Needs regularization for stability	Expressivity restricted, dependence on monotonicity for path matching

Quadratic and diagonal neural networks thus exemplify two opposite directions in neural architecture design: quadratic networks maximally exploit inter-feature synergy for expressivity and efficient high-order modeling, whereas diagonal or diagonalized designs maximize simplicity and sparsity, enabling analytic regularization dynamics and practical scalability for selected applications. Recent advances highlight the value of both classes for specific use cases, elucidate their underlying training dynamics, and suggest that their combination or modular adaptation may yield further efficiency and robustness gains in large-scale and adaptive machine learning systems.