Linear Convolutional Networks Overview

Updated 20 January 2026

Linear Convolutional Networks are defined as deep neural architectures composed exclusively of linear convolution operations, offering a framework for implicit regularization and rich geometric properties.
They leverage Fourier domain diagonalization to efficiently analyze spectral characteristics and induce frequency-domain biases distinct from fully connected networks.
Their structured design facilitates insights into complex optimization landscapes, Riemannian function-space geometry, and practical benefits like filter redundancy and improved generalization.

A linear convolutional network is a deep neural architecture in which each layer applies a convolutional operator (without bias or nonlinearity) to its input, yielding a composition of linear convolutions. Despite their functional equivalence to a single convolutional mapping, the structure and parameterization of linear convolutional networks (LCNs) induce rich implicit regularization, geometric, and algebraic phenomena that make them central objects in the analysis of deep learning theory, implicit bias, and optimization landscapes. Their study also provides key insight into the behavior of nonlinear convolutional networks in certain regimes and enables efficient algebraic manipulation, particularly in the analysis of spectral properties and function-space geometry.

1. Model Definition and Representations

A depth‑ $L$ linear convolutional network in the canonical 1D case comprises a sequence of layer-wise convolutions with learnable filters, no activation or bias, and (possibly multi-channel) inputs and outputs. For width- $D$ , depth- $L$ full-width models, the standard time-domain recursion is

$\begin{aligned} & h_0(x) = x, \ & h_{\ell}[d] = (h_{\ell-1} \star u_\ell)[d] = \frac{1}{\sqrt{D}} \sum_{k=0}^{D-1} u_\ell[k]\, h_{\ell-1}[(d+k) \bmod D] \quad (1 \leq \ell < L), \ & f_u(x) = h_{L-1}(x)^\top u_L, \end{aligned}$

with $u_\ell \in \mathbb{R}^D$ . The network is thus specified by a parameter tuple $u = (u_1, \ldots, u_L)$ .

A single-layer equivalent representation always exists:

$f_u(x) = x^\top w, \quad w = P_\text{conv}(u),$

where $P_\text{conv}$ is a homogeneous polynomial map of degree $L$ . This map admits a diagonalization in the Fourier domain: for the unitary DFT matrix $F$ , the frequency coefficients $\widehat{w}[k] = \prod_{\ell=1}^L \widehat{u}_\ell[k]$ for all $k$ , so $P_\text{conv}$ acts as a coordinatewise product after DFT (Gunasekar et al., 2018).

2. Inductive Bias and Frequency-Domain Regularization

Unlike fully connected linear networks, linear convolutional networks induce a distinct implicit regularization via gradient-based optimization. For gradient descent on separable data and exponential-type losses (e.g., logistic), the limit direction of the network weights is characterized by a constrained minimization:

Model Class	Limit Direction Minimizes	Regularizer (Fourier Domain)
Fully connected (any depth)	$\min_w \\|w\\|_2^2$ subject to $y_n x_n^\top w \geq 1$	$\ell_2$ norm ( $p=2$ )
Convolutional, depth $L$	$\min_w \\| F w \\|_{2/L}^{2/L}$ subject to $y_n x_n^\top w \geq 1$	$\ell_{2/L}$ quasi‑norm, bridge penalty

For $L=2$ , this is the $\ell_1$ norm in the Fourier domain (i.e., promotes frequency sparsity). For $L>2$ , the penalty becomes a nonconvex quasi-norm, intensifying frequency sparsity as $L$ increases (Gunasekar et al., 2018). This effect has a major impact on generalization and the spectral characteristics of learned filters.

In the multitasking/multichannel and multi-layer cases, the function-space induced regularizer is captured by nuclear or group norms on the Fourier coefficients, with closed forms for extremal kernel sizes (Jagadeesan et al., 2021). For two-layer, single-channel input, the implicit $\ell_2$ inductive bias does not depend on the output width $C$ , but for multi-input channels, width matters until $C \gg R$ (input channels), beyond which the bias stabilizes.

3. Geometric and Algebraic Structure

3.1 Function Space Geometry

The set of functions realisable by an LCN with given architecture (filter sizes $k_1, \ldots, k_L$ , strides $s_1, \ldots, s_L$ ) forms a semi-algebraic subset of the set of linear maps from input to output (Kohn et al., 2021, Kohn et al., 2023, Shahverdi, 2024). The mapping from parameters to function space can be described through sparse, multihomogeneous polynomial factorizations:

$\pi_1(w) = \prod_{l=1}^L \pi_{S_l}(w_l),$

where $\pi_s(w)$ maps a filter vector to a homogeneous bivariate polynomial in $x^s, y^s$ , and $S_l = \prod_{i<l} s_i$ . Thus, the neuromanifold of a 1D-LCN is a set of polynomials with a prescribed layered factorization structure.

The function space has dimension $\sum_l k_l - (L-1)$ , with lower dimension than the full ambient convolutional function space except in specific cases (all strides 1, "filling" architectures).
The boundary and singular locus correspond to multiple factor coincidences or repeated roots; spurious critical points typically occur at these singularities.

3.2 Optimization Landscape and Critical Points

The construction of LCNs as polynomial maps means the loss surface on parameter space is highly nonconvex, but for generic data and all-strides-greater-than-one architectures, nonzero critical points of the squared error correspond to smooth interior points in function space—i.e., isolated spurious criticals are avoided (Kohn et al., 2023). The total number of critical points (the EDdegree) is equal to that of the Segre variety with the same partition, which can far surpass the determinant-based count for fully-connected networks (Shahverdi, 2024).

Convergence guarantees for continuous-time gradient flow exist under mild conditions: solutions remain bounded and converge to critical points of the risk, even though the parameterization is nonconvex (Diederen et al., 13 Jan 2026).

4. Spectral Structure and Analysis

Each multi-channel/dimensional convolutional layer can be represented as a block (doubly) Toeplitz matrix acting on the vectorized input. The spectral density matrix $F(\omega_1, \omega_2)$ , derived by Fourier analysis of filter kernels, exactly characterizes asymptotic singular value distributions as $n \to \infty$ (extended Szegő theorem). Practical approximation schemes (circular wrapping, quantile interpolation) closely estimate singular values and are computationally scalable (Yi, 2020). Tight spectral norm upper bounds can be computed efficiently and shown to be effective as regularizers for improving generalization, e.g., in ResNets.

5. Implicit Bias Under Normalization and Training Dynamics

The introduction of normalization (e.g., batch normalization) modifies the implicit bias in significant ways:

For two-layer, single-filter LCNs with patchwise inputs, batch normalization induces a patch-uniform margin bias, causing the learned classifier to maximize the minimum margin across all patches, rather than on the aggregate input (Cao et al., 2023).
Convergence to uniform margin is exponential in $\log^2 t$ , much faster than in unnormalized training (which converges at $O(1/\log t)$ ).
Patch-uniform margin can significantly improve robustness and generalization in settings with spatial heterogeneity.

6. Linear Factorizations, Redundancy, and Practical Implementation

Many modern CNN architectures exploit filter redundancy—linear dependency among convolutional filters. The LinearConv paradigm formalizes this by learning a small orthonormal basis for each layer and representing all filters as linear combinations, enforced by a correlation regularizer (Kahatapitiya et al., 2019). Empirically, LinearConv achieves near-baseline accuracy with approximately 50% fewer parameters, preserving the computational complexity at inference time.

Even randomly chosen spatial filters, paired with a learned $1\! \times\! 1$ mixing, can suffice to match (and sometimes surpass) performance of fully learned convolutional kernels. This architecture shifts the bias by regularizing via frozen diversity and reducing spatial adaptability, and can even increase adversarial robustness (Gavrikov et al., 2023). In bottleneck and mobile-style designs, replacing spatial convolutions with random banks plus learned mixing often yields negligible performance loss.

7. Function-Space Geometry, Riemannian Metrics, and Gradient Flow

The geometry of gradient flow for LCNs can be completely described in function space, revealing a rich Riemannian structure. For $D\geq 2$ (or 1D with all strides $>1$ ), the parameter-to-function map is generically a finite cover, and the natural Euclidean gradient flow in parameter space projects to a Riemannian gradient flow in function space, with the metric induced by the neural tangent kernel on fibers of fixed norm gap invariants (Achour et al., 8 Jul 2025). This geometric perspective enables the formulation of global convergence statements, study of the curvature of neural manifolds, and generalizes fully-connected linear network theory.

References:

(Gunasekar et al., 2018): Implicit bias of gradient descent, Fourier domain penalties
(Jagadeesan et al., 2021): Function-space regularization, multi-channel structure
(Kohn et al., 2021, Kohn et al., 2023, Shahverdi, 2024): Semi-algebraic geometry, critical point analysis
(Achour et al., 8 Jul 2025): Riemannian geometry and gradient flow structure
(Diederen et al., 13 Jan 2026): Gradient flow convergence
(Yi, 2020): Spectral properties of multi-channel convolutions
(Cao et al., 2023): Implicit bias under batch normalization
(Kahatapitiya et al., 2019, Gavrikov et al., 2023): Filter redundancy, linear combinations, practical compression