Layerwise Linear Models

Updated 6 November 2025

Layerwise linear models are a framework that treats each neural network layer as a linear transformation to enable tractable analysis and efficient training.
They leverage techniques such as network linearization, neural tangent kernels, and layer-specific aggregation to reveal synchronized layer dynamics and improved generalization.
Applications include scalable training, robust federated learning, and model compression, though they face limitations in representing high-dimensional non-polynomial functions.

Layerwise linear models are a foundational framework in modern machine learning which analyze, design, or optimize neural networks by leveraging linearity at the level of individual layers, often for the purposes of tractable inference, robust aggregation, efficient representation, interpretability, or theoretical analysis. These models appear in diverse forms, from explicit linearization of complex networks to practical methodologies that enforce or exploit layerwise structure in both training and aggregation. The framework encompasses both theoretical constructions—such as linearized neural network (NN) dynamics and kernel approximations—and practical algorithms in scalable training, federated learning, and model compression.

1. Conceptual Foundations and Definitions

A layerwise linear model is any modeling scenario or analysis where each layer of a deep neural network is treated as a linear transformation, either exactly (in the case of linear nets or local linearization) or approximately (via first-order expansion, such as in neural tangent kernel theory). Let a neural network be expressed as $\mathbf{y} = f_L(\cdots f_2(f_1(\mathbf{x})))$ ; the layerwise linear perspective considers either $f_\ell$ to be linear, or studies its linearization at the current parameters. In practice, this formalism surfaces in:

Multi-layer linear neural networks (LNNs), in which all $f_\ell$ are linear maps (Basu et al., 2019).
Linearized models: neural networks expanded to first order in parameters, resulting in a sum over layerwise linear contributions (Misiakiewicz et al., 2023).
Explicitly layerwise-trained or analyzed models, such as greedy layerwise learning or Bayesian layerwise inference (Belilovsky et al., 2018, Kurle et al., 18 Nov 2024).
Federated learning algorithms performing per-layer aggregation for robust model fusion (García-Márquez et al., 27 Mar 2025).

Layerwise linearity provides a tractable interface between deep learning's expressive power and linear models' analytic clarity.

2. Theoretical Analysis of Layerwise Linear Models

Linear Neural Networks and Layer Dynamics

In strictly linear settings, with $f_\ell(\mathbf{x}) = W_\ell \mathbf{x}$ , the composite map is a product of weight matrices, yielding $f(\mathbf{x}) = W_L \cdots W_1 \mathbf{x}$ . Analysis has revealed a remarkable symmetry: under gradient descent, the Frobenius norms of all $W_\ell$ grow approximately identically throughout training, with the differences between layer norms strictly bounded and, for orthogonal or Glorot initialization, exactly equal (Basu et al., 2019). This symmetry induces synchronized learning phases, precludes bottleneck effects, and accentuates depth-driven acceleration—layer norms exhibit a protracted phase of slow growth followed by a rapid, concerted increase as the network approaches loss minimization.

In practical neural architectures, nonlinearities (e.g., ReLU) break this global symmetry. However, piecewise-linear localizations of deep ReLU nets allow for transient recovery of linear layerwise dynamics, particularly observable in upper layers as training progresses and representations within classes align.

Linearization and the Lazy Regime

In overparameterized settings and under certain training scalings, deep NNs operate in the "lazy" regime, where parameter updates remain close to initialization (Misiakiewicz et al., 2023). Linearizing the network about the initial parameters, $f(\mathbf{x}; \theta) \approx f(\mathbf{x}; \theta_0) + \langle \theta-\theta_0, \nabla_\theta f(\mathbf{x}; \theta_0) \rangle$ , enables the analysis of layerwise contributions via fixed feature maps (random features, neural tangent features). This linearization underpins theoretical advances in understanding double descent, benign overfitting, and generalization, and connects model outputs closely to classical kernel ridge regression and high-dimensional linear regression.

Model Type	Features Trained	Feature Matrix	Limiting Behavior
Linear regression	None	Covariates	Universal risk formulas
Kernel ridge regression	Infinite features	Kernel matrix	Staircase test error, overfitting
Random feature model	Second layer	Random-feature map	Double descent, match KRR when $N \gg n$
Neural tangent model	First layer (linear)	Neural tangent feature map	KRR equivalent if $Nd \gg n$

The critical limitation is that linearized models—operating in fixed feature spaces—cannot adaptively learn new features beyond their span, restricting their capacity to learn certain function classes, such as non-polynomial ridge functions, to polynomial sample complexity.

3. Methodologies and Algorithms

Greedy and Bayesian Layerwise Approaches

Layerwise linearity is exploited operationally in several training protocols:

Greedy Layerwise Learning: Each layer of a deep neural network (typically a CNN) is trained independently to optimize an auxiliary classification objective, often as a 1-hidden layer model (Belilovsky et al., 2018). Layers are sequentially stacked, with each subsequent model building upon the frozen representations of its predecessors. This approach achieves competitive accuracy on large-scale datasets (e.g., ImageNet), enforces explicit per-layer linear separability progression, requires lower memory during training, and produces more interpretable representations than standard end-to-end training.
Bayesian Layerwise Inference (BALI): Deep networks are treated as a stack of Bayesian linear regression models, one per layer (Kurle et al., 18 Nov 2024). For each layer, pseudo-targets are derived using the layer's outputs updated by gradients obtained from backpropagation. The exact matrix-normal posterior (with Kronecker-factored covariance) is computed per layer, using exponentially decayed sufficient statistics when operating in a mini-batch regime. This method preserves uncertainty quantification and local convexity per layer, providing efficient posterior estimation and robustness, and fundamentally breaks problematic permutation symmetries inherent to global Bayesian approaches in deep NNs.

Layerwise Aggregation and Robustness in Distributed Learning

In federated learning, robust global aggregation schemes such as Krum, Bulyan, or GeoMed are often compromised in high-dimensional parameter spaces. Recent advances propose layerwise aggregation (e.g., Layerwise Cosine Aggregation), where each layer's updates from clients are aggregated independently, leveraging cosine distance and median gradient clipping for norm invariance and attack resilience (García-Márquez et al., 27 Mar 2025). Theoretical analysis confirms that these modifications preserve or improve $(\alpha, f)$ -Byzantine resilience, tightening robustness bounds by reducing the effective dimensionality of the aggregation space from the total parameter dimension $d$ to the maximal per-layer width. Empirically, the approach achieves up to 16% accuracy improvement over conventional robust aggregation in adversarial scenarios.

4. Implicit Bias, Regularization, and Model Selection

Adding linear layers to ReLU networks, particularly with weight decay, alters the representation cost associated with interpolating functions, biasing the network toward functions with low mixed variation—i.e., those that depend primarily on a low-dimensional projection of the input ("single- or multi-index models") (Parkinson et al., 2023). Mathematically, the implicit penalization induced by multiple linear layers with weight decay converges to minimization of the Schatten quasi-norm of a "virtual" weight matrix, where the order $q=2/(L-1)$ shrinks with increasing depth. This causes the network to favor functions whose active subspace has minimal rank, directly connecting architectural choices with the structural properties of the learned mapping. Empirical results demonstrate that this layerwise bias enhances generalization and subspace alignment when the data is generated by low-dimensional latent processes.

5. Layerwise Linear Connectivity and Representation Analysis

Empirical and theoretical studies of solution landscapes in deep networks reveal that layerwise parameter interpolation between distinct trained models does not yield loss barriers, even when global interpolation (simultaneous averaging of all layers) does (Adilova et al., 2023). This property, termed Layer-Wise Linear Mode Connectivity (LLMC), holds broadly across architectures (CNNs, transformers, LLMs), training initializations, and data splits. Cumulative barriers may arise only when groups of middle layers are interpolated, identifying these as critical for encoding task-specific information. Extensions such as Layerwise Linear Feature Connectivity (LLFC) further show that, when parameter interpolation between models results in connected minima, the feature maps of each layer along the interpolation path are also linearly interpolated up to scaling (Zhou et al., 2023). These phenomena indicate a layerwise modularity and redundancy within high-dimensional neural representations, with implications for model averaging, ensembling, and federated learning protocol design.

Connectivity Concept	Over What?	Quantitative Measure	Holds When
LMC (mode connectivity)	Output/loss	Scalar loss vs. α along linear path	Spawned/permuted
LLFC (feature connectivity)	All feature layers	Layer activations at α along path	Spawned/permuted
LLMC (layerwise param)	Per-layer weights	No loss barrier in per-layer interp.	Generally

6. Applications and Limitations

Layerwise linear models underpin efficient network compression and pruning, robust aggregation, and theoretical guarantees for generalization and optimization. Structured efficient linear substitutions (e.g., TT, HashedNet, ACDC) operate at the layer level to improve the parameter-per-accuracy trade-off in convolutional architectures (Gray et al., 2019). Pruning methods for LLMs exploit non-uniform layerwise sparsity, informed by the layerwise distribution of activation/weight outliers, to greatly surpass uniform strategies in both model performance and inference efficiency (Yin et al., 2023).

However, linear theory intrinsically limits function approximation to the kernel or span defined at initialization, precluding efficient representation of high-dimensional ridge functions and certain non-polynomial transformations (Misiakiewicz et al., 2023). Recent research addresses these deficiencies via mean-field and maximal-update regimes, which enable genuine feature learning by allowing movement far from initialization, and via architectural and regularization choices that bias networks toward appropriately constrained solution spaces.

7. Future Directions and Theoretical Implications

Layerwise linear models serve as a critical bridge between tractable mathematical analysis and the practical demands of designing, training, and deploying deep networks. They clarify the impact of depth, parameterization, and regularization on representation bias, reveal hitherto unnoticed modularity in learned solutions, and provide rigorous, interpretable algorithms for scalable training and robust aggregation. A continuing challenge is extending these principles beyond the linear or near-linear regime to capture the full expressive and adaptive capacities of nonlinear deep networks, while maintaining analytic and computational tractability. The development of hybrid local-linear/global-nonlinear analyses and the integration of Bayesian, optimization-theoretic, and statistical perspectives are promising avenues for advancing both foundational understanding and practical capabilities in deep learning.