Linearized Wide Two-Layer Neural Policies
- The paper presents a rigorous analysis of linearized wide two-layer neural policies, quantifying expressivity, generalization, and robustness trade-offs in reinforcement learning tasks.
- Methodologies leverage explicit linearization through random features and NTK expansions, enabling kernel-based optimization with predictable convergence in high-dimensional settings.
- Implications for architecture design emphasize that moderate width scaling captures low-frequency patterns effectively while highlighting limitations in fitting high-frequency functions, motivating adaptive nonlinear extensions.
Linearized wide two-layer neural policies are policy architectures and optimization frameworks in which a two-layer (one-hidden-layer) neural network with a very wide hidden layer is either explicitly or effectively studied via linearization—usually around its random initialization. This approach enables precise analysis of expressivity, generalization, representation learning, and robustness, while providing concrete performance–complexity trade-offs and characterizing the limitations of “lazy training” in reinforcement learning and planning contexts. The linearized models encompass both random features (RF) and neural tangent kernel (NTK) regimes, and their performance has been rigorously compared to fully-trained nonlinear networks, especially in high-dimensional domains and complex control tasks.
1. Mathematical Formulation and Linearization Regimes
The canonical two-layer network parameterization for policies (or value functions) is
where is the width, the activation, output weights, and input weights. Linearization refers to approximating the learned function by restricting the learning or parameter update as follows:
- Random Features (RF): fixed at random initialization, only trained: is a linear combination of random “basis” functions . This is equivalent to linear regression in a random feature space (Ghorbani et al., 2019, Misiakiewicz et al., 2023).
- Neural Tangent Kernel (NTK): First-order Taylor expansion of the network around initialization, allowing and to change infinitesimally. The effective function space is the span of all first-order derivatives at initialization:
In this view, learning dynamics are governed by a fixed NTK, and the network behaves as a linear (kernel) model (Ghorbani et al., 2019, Misiakiewicz et al., 2023).
The population risk forms the basis of quantitative analysis. Both RF and NT models converge to their infinite-width (kernel) analogues under suitable scaling of with input dimension .
In reinforcement learning or planning applications, the policy is often used, with the logits given as the output of a wide two-layer neural network (Junyent et al., 2019), and training modifies either the last-layer weights only (RF) or uses the NTK linearization.
2. Expressivity, Approximation Capacity, and Staircase Phenomena
The RF and NTK linearizations exhibit a strict capacity–parameterization–dimension tradeoff. In (Ghorbani et al., 2019), for high-dimensional inputs ( or on the sphere ) and fixed activation , two key regimes are identified:
- Approximation-Limited: , polynomial in . The RF model with can fit all degree- polynomials in , but not higher degree. NTK linearization can fit degree- polynomials, matching the added derivative in the expansion, but still missing higher-order components.
- Sample-Limited (Kernel) Regime: , polynomial in . Kernel ridge regression with NTK (or RF kernel) can fit at most degree- polynomials if .
This "staircase" effect strictly limits the kinds of behaviors linearized wide two-layer neural policies can accomplish for a given problem complexity and parameter/sample budget. It formalizes observed "gaps" between the capacity of lazy linearized policies and that of fully-trained, nonlinear policies—even for targets that are trivial to fit with a single nonlinear neuron but not within the linearized regime (Ghorbani et al., 2019, Misiakiewicz et al., 2023, Ghorbani et al., 2019).
The choice of (width) is thus key—underprovisioning relative to leads to limited expressivity. In RL and planning settings, this relates directly to which distributions or value/policy function classes the policy can approximate and hence which decision boundaries or exploration strategies can be realized.
3. Generalization, Bias, and Frequency Principle
Wide, linearized two-layer networks display an implicit bias: they favor smoother (low-frequency) function classes over highly oscillatory ones. The "Frequency Principle" (F-Principle), formalized in (Zhang et al., 2019), asserts that these networks fit low frequencies first and penalize high frequencies. The optimization is equivalent to minimizing an explicit quadratic FP-norm over Fourier components:
The bias grows for target policies rich in high-frequency content, leading to higher generalization error. The explicit FP-norm gives a means for quantifying a priori error bounds and for understanding why, in practical RL or planning tasks with smooth solution structure (e.g., navigation, manipulation), linearized wide two-layer policies perform robustly, but may falter in tasks with subtle, high-frequency actions or rewards.
4. Optimization, Training Dynamics, and Robustness
Training in the linearized regime is convex: for NTK/RF models, optimal solutions can generally be obtained via least-squares, kernel ridge regression, or cross-entropy minimization over policy outputs (Junyent et al., 2019). This makes convergence predictable and efficiently computable. Moreover, in the infinite-width limit, stochastic gradient descent (SGD) dynamics for the parameter empirical measure converge (law of large numbers/central limit theorem) to deterministic/stochastic mean-field PDEs (Descours et al., 2022).
Robustness considerations, however, reveal trade-offs (Dohmatob et al., 2022): in the NTK and RF regimes, test error and robustness (measured as the Dirichlet energy of the policy function) exhibit a zero-sum relationship. As policy approximation improves through linearization, robustness to adversarial perturbations generally degrades, particularly if the initialization scale or structure is suboptimal. Networks trained in the pure NTK regime or with large initialization terms inherit substantial nonrobustness, whereas full nonlinear training can sometimes maintain a more favorable trade-off.
5. Practical Applications and Empirical Findings
Empirical studies validate the performance of linearized wide two-layer neural policies in a variety of domains:
- Planning from Pixels: In -IW (Junyent et al., 2019), a wide two-layer network mapping screen inputs to action probabilities learns a feature representation whose last hidden layer activations are binarized and used for width-based novelty pruning in a search tree. This removes the need for hand-designed state features. Notably, π-IW outperforms rollout IW and AlphaZero in certain grid and Atari tasks despite using a shallow architecture, illustrating that linearized policies over rich learned features suffice for challenging domains.
- High-Dimensional Policy Learning: In RL perceptrons (Patel et al., 2023), a two-layer network’s macroscopic policy learning can be reduced to ODEs describing norm and alignment evolution of weights with the teacher. This enables closed-form derivation of optimal learning rate and curriculum annealing schedules, which were empirically shown to explain trade-offs in speed and final accuracy in high-dimensional settings, matching behaviors observed in trained deep policies.
- Geometric Structure and Low-Effective-Dimension: In continuous control, training a linearized (wide) two-layer neural policy via policy gradient/actor-critic methods induces a low-dimensional manifold of attainable states—its dimension upper-bounded by for actions (Tiwari et al., 28 Jul 2025). Empirical results in MuJoCo environments confirm learned states concentrate on manifolds of dimension far less than the ambient space, validating theoretical predictions and motivating architectural augmentations (e.g., sparse manifold layers) that improve practical performance.
- Sample Complexity and Expressivity: Randomization techniques that push network dynamics beyond first-order linearization (“escaping the NTK”) allow the network to align with quadratic or higher-order Taylor terms (Bai et al., 2019). The resulting model shows improved sample complexity (by up to a factor of dimension for quadratic, or for -th order approximations) under isotropic input distributions, and enjoys benign optimization landscapes.
6. Limitations, Extensions, and Theoretical Insights
Despite their tractability, linearized wide two-layer neural policies have important limitations:
- Expressive Power: Linearized models cannot learn target functions requiring genuinely nonlinear feature learning unless sufficiently overparameterized; expressivity is bottlenecked by “polynomial degree” as a function of and (Ghorbani et al., 2019, Misiakiewicz et al., 2023, Ghorbani et al., 2019). Fully-trained networks escape these constraints, sometimes by orders of magnitude in sample efficiency.
- Robustness: As established in (Dohmatob et al., 2022), robustness to adversarial or structured noise is generally inferior in the lazy/linearized regime because the policy preserves or even amplifies undesirable properties from initialization.
- Low-Dimensional Structure: Symmetries in input distribution or target function can collapse dynamics to lower-dimensional PDEs; for odd targets, linear predictors suffice, and in subspace-structured tasks, weight evolution is confined to the low-dimensional structure (Hajjar et al., 2022).
Extensions to adaptive scaling (“mean-field” regime), higher-order Taylor approximations, or incorporation of sparse/local manifold layers ameliorate some expressivity and performance issues, combining linearized analysis for tractability with nonlinear advances for practical efficacy (Bai et al., 2019, Tiwari et al., 28 Jul 2025).
7. Implications for Architecture and Algorithm Design
The collected findings provide guidance for constructing effective linearized wide two-layer neural policies:
- For tasks with smooth value functions, moderate polynomial scaling of width in suffices for both RF and NTK models; for high-frequency or highly nonlinear tasks, either larger widths or escaping the linearized regime is essential (Ghorbani et al., 2019, Misiakiewicz et al., 2023).
- Binary feature extraction from hidden layers (as in π-IW) enables width-based planners to operate without manual feature design even in high-dimensional pixel spaces (Junyent et al., 2019).
- Explicit control and tuning of initialization (e.g., as per (Dohmatob et al., 2022)), regularization, and architectural modifications (such as sparse or low-dimensional layers (Tiwari et al., 28 Jul 2025)) can improve both robustness and sample efficiency.
- Kernel perspectives supply closed-form risk and generalization error estimates for design-time evaluation.
- When strict feasibility or optimality is required (e.g., in parametric optimization (Bae et al., 2023)), piecewise linear policy approximators built from triangulated training data, then approximated by neural networks, admit formal guarantees on approximate feasibility and suboptimality.
A plausible implication is that, while linearized wide two-layer policies are favorable for tractability and sample-efficient learning in suitably structured tasks, further improvement—particularly for robustness or tasks that require higher-order feature learning—demands architectural or optimization approaches that move beyond this regime.
Key Cited Works:
- Expressivity and polynomial approximation limits: (Ghorbani et al., 2019, Misiakiewicz et al., 2023, Ghorbani et al., 2019)
- Frequency principle and generalization: (Zhang et al., 2019)
- Robustness analysis and adversarial sensitivity: (Dohmatob et al., 2022)
- Empirical and algorithmic validation in RL/planning: (Junyent et al., 2019, Patel et al., 2023, Tiwari et al., 28 Jul 2025)
- Escaping NTK and higher-order expressivity: (Bai et al., 2019)
- Symmetry and low-dimensional reduction: (Hajjar et al., 2022)
- Parametric optimization and policy approximation: (Bae et al., 2023)