Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear Convolutional Networks Overview

Updated 20 January 2026
  • Linear Convolutional Networks are defined as deep neural architectures composed exclusively of linear convolution operations, offering a framework for implicit regularization and rich geometric properties.
  • They leverage Fourier domain diagonalization to efficiently analyze spectral characteristics and induce frequency-domain biases distinct from fully connected networks.
  • Their structured design facilitates insights into complex optimization landscapes, Riemannian function-space geometry, and practical benefits like filter redundancy and improved generalization.

A linear convolutional network is a deep neural architecture in which each layer applies a convolutional operator (without bias or nonlinearity) to its input, yielding a composition of linear convolutions. Despite their functional equivalence to a single convolutional mapping, the structure and parameterization of linear convolutional networks (LCNs) induce rich implicit regularization, geometric, and algebraic phenomena that make them central objects in the analysis of deep learning theory, implicit bias, and optimization landscapes. Their study also provides key insight into the behavior of nonlinear convolutional networks in certain regimes and enables efficient algebraic manipulation, particularly in the analysis of spectral properties and function-space geometry.

1. Model Definition and Representations

A depth‑LL linear convolutional network in the canonical 1D case comprises a sequence of layer-wise convolutions with learnable filters, no activation or bias, and (possibly multi-channel) inputs and outputs. For width-DD, depth-LL full-width models, the standard time-domain recursion is

h0(x)=x, h[d]=(h1u)[d]=1Dk=0D1u[k]h1[(d+k)modD](1<L), fu(x)=hL1(x)uL,\begin{aligned} & h_0(x) = x, \ & h_{\ell}[d] = (h_{\ell-1} \star u_\ell)[d] = \frac{1}{\sqrt{D}} \sum_{k=0}^{D-1} u_\ell[k]\, h_{\ell-1}[(d+k) \bmod D] \quad (1 \leq \ell < L), \ & f_u(x) = h_{L-1}(x)^\top u_L, \end{aligned}

with uRDu_\ell \in \mathbb{R}^D. The network is thus specified by a parameter tuple u=(u1,,uL)u = (u_1, \ldots, u_L).

A single-layer equivalent representation always exists:

fu(x)=xw,w=Pconv(u),f_u(x) = x^\top w, \quad w = P_\text{conv}(u),

where PconvP_\text{conv} is a homogeneous polynomial map of degree LL. This map admits a diagonalization in the Fourier domain: for the unitary DFT matrix FF, the frequency coefficients w^[k]==1Lu^[k]\widehat{w}[k] = \prod_{\ell=1}^L \widehat{u}_\ell[k] for all kk, so PconvP_\text{conv} acts as a coordinatewise product after DFT (Gunasekar et al., 2018).

2. Inductive Bias and Frequency-Domain Regularization

Unlike fully connected linear networks, linear convolutional networks induce a distinct implicit regularization via gradient-based optimization. For gradient descent on separable data and exponential-type losses (e.g., logistic), the limit direction of the network weights is characterized by a constrained minimization:

Model Class Limit Direction Minimizes Regularizer (Fourier Domain)
Fully connected (any depth) minww22\min_w \|w\|_2^2 subject to ynxnw1y_n x_n^\top w \geq 1 2\ell_2 norm (p=2p=2)
Convolutional, depth LL minwFw2/L2/L\min_w \| F w \|_{2/L}^{2/L} subject to ynxnw1y_n x_n^\top w \geq 1 2/L\ell_{2/L} quasi‑norm, bridge penalty

For L=2L=2, this is the 1\ell_1 norm in the Fourier domain (i.e., promotes frequency sparsity). For L>2L>2, the penalty becomes a nonconvex quasi-norm, intensifying frequency sparsity as LL increases (Gunasekar et al., 2018). This effect has a major impact on generalization and the spectral characteristics of learned filters.

In the multitasking/multichannel and multi-layer cases, the function-space induced regularizer is captured by nuclear or group norms on the Fourier coefficients, with closed forms for extremal kernel sizes (Jagadeesan et al., 2021). For two-layer, single-channel input, the implicit 2\ell_2 inductive bias does not depend on the output width CC, but for multi-input channels, width matters until CRC \gg R (input channels), beyond which the bias stabilizes.

3. Geometric and Algebraic Structure

3.1 Function Space Geometry

The set of functions realisable by an LCN with given architecture (filter sizes k1,,kLk_1, \ldots, k_L, strides s1,,sLs_1, \ldots, s_L) forms a semi-algebraic subset of the set of linear maps from input to output (Kohn et al., 2021, Kohn et al., 2023, Shahverdi, 2024). The mapping from parameters to function space can be described through sparse, multihomogeneous polynomial factorizations:

π1(w)=l=1LπSl(wl),\pi_1(w) = \prod_{l=1}^L \pi_{S_l}(w_l),

where πs(w)\pi_s(w) maps a filter vector to a homogeneous bivariate polynomial in xs,ysx^s, y^s, and Sl=i<lsiS_l = \prod_{i<l} s_i. Thus, the neuromanifold of a 1D-LCN is a set of polynomials with a prescribed layered factorization structure.

  • The function space has dimension lkl(L1)\sum_l k_l - (L-1), with lower dimension than the full ambient convolutional function space except in specific cases (all strides 1, "filling" architectures).
  • The boundary and singular locus correspond to multiple factor coincidences or repeated roots; spurious critical points typically occur at these singularities.

3.2 Optimization Landscape and Critical Points

The construction of LCNs as polynomial maps means the loss surface on parameter space is highly nonconvex, but for generic data and all-strides-greater-than-one architectures, nonzero critical points of the squared error correspond to smooth interior points in function space—i.e., isolated spurious criticals are avoided (Kohn et al., 2023). The total number of critical points (the EDdegree) is equal to that of the Segre variety with the same partition, which can far surpass the determinant-based count for fully-connected networks (Shahverdi, 2024).

Convergence guarantees for continuous-time gradient flow exist under mild conditions: solutions remain bounded and converge to critical points of the risk, even though the parameterization is nonconvex (Diederen et al., 13 Jan 2026).

4. Spectral Structure and Analysis

Each multi-channel/dimensional convolutional layer can be represented as a block (doubly) Toeplitz matrix acting on the vectorized input. The spectral density matrix F(ω1,ω2)F(\omega_1, \omega_2), derived by Fourier analysis of filter kernels, exactly characterizes asymptotic singular value distributions as nn \to \infty (extended Szegő theorem). Practical approximation schemes (circular wrapping, quantile interpolation) closely estimate singular values and are computationally scalable (Yi, 2020). Tight spectral norm upper bounds can be computed efficiently and shown to be effective as regularizers for improving generalization, e.g., in ResNets.

5. Implicit Bias Under Normalization and Training Dynamics

The introduction of normalization (e.g., batch normalization) modifies the implicit bias in significant ways:

  • For two-layer, single-filter LCNs with patchwise inputs, batch normalization induces a patch-uniform margin bias, causing the learned classifier to maximize the minimum margin across all patches, rather than on the aggregate input (Cao et al., 2023).
  • Convergence to uniform margin is exponential in log2t\log^2 t, much faster than in unnormalized training (which converges at O(1/logt)O(1/\log t)).
  • Patch-uniform margin can significantly improve robustness and generalization in settings with spatial heterogeneity.

6. Linear Factorizations, Redundancy, and Practical Implementation

Many modern CNN architectures exploit filter redundancy—linear dependency among convolutional filters. The LinearConv paradigm formalizes this by learning a small orthonormal basis for each layer and representing all filters as linear combinations, enforced by a correlation regularizer (Kahatapitiya et al., 2019). Empirically, LinearConv achieves near-baseline accuracy with approximately 50% fewer parameters, preserving the computational complexity at inference time.

Even randomly chosen spatial filters, paired with a learned 1 ⁣× ⁣11\! \times\! 1 mixing, can suffice to match (and sometimes surpass) performance of fully learned convolutional kernels. This architecture shifts the bias by regularizing via frozen diversity and reducing spatial adaptability, and can even increase adversarial robustness (Gavrikov et al., 2023). In bottleneck and mobile-style designs, replacing spatial convolutions with random banks plus learned mixing often yields negligible performance loss.

7. Function-Space Geometry, Riemannian Metrics, and Gradient Flow

The geometry of gradient flow for LCNs can be completely described in function space, revealing a rich Riemannian structure. For D2D\geq 2 (or 1D with all strides >1>1), the parameter-to-function map is generically a finite cover, and the natural Euclidean gradient flow in parameter space projects to a Riemannian gradient flow in function space, with the metric induced by the neural tangent kernel on fibers of fixed norm gap invariants (Achour et al., 8 Jul 2025). This geometric perspective enables the formulation of global convergence statements, study of the curvature of neural manifolds, and generalizes fully-connected linear network theory.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linear Convolutional Networks.