Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
118 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
34 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Linearized Wide Two-Layer Neural Policies

Updated 30 July 2025
  • The paper presents a rigorous analysis of linearized wide two-layer neural policies, quantifying expressivity, generalization, and robustness trade-offs in reinforcement learning tasks.
  • Methodologies leverage explicit linearization through random features and NTK expansions, enabling kernel-based optimization with predictable convergence in high-dimensional settings.
  • Implications for architecture design emphasize that moderate width scaling captures low-frequency patterns effectively while highlighting limitations in fitting high-frequency functions, motivating adaptive nonlinear extensions.

Linearized wide two-layer neural policies are policy architectures and optimization frameworks in which a two-layer (one-hidden-layer) neural network with a very wide hidden layer is either explicitly or effectively studied via linearization—usually around its random initialization. This approach enables precise analysis of expressivity, generalization, representation learning, and robustness, while providing concrete performance–complexity trade-offs and characterizing the limitations of “lazy training” in reinforcement learning and planning contexts. The linearized models encompass both random features (RF) and neural tangent kernel (NTK) regimes, and their performance has been rigorously compared to fully-trained nonlinear networks, especially in high-dimensional domains and complex control tasks.

1. Mathematical Formulation and Linearization Regimes

The canonical two-layer network parameterization for policies (or value functions) is

f(x)=i=1Naiσ(wix),f(x) = \sum_{i=1}^N a_i \sigma(w_i^\top x),

where NN is the width, σ\sigma the activation, aia_i output weights, and wiw_i input weights. Linearization refers to approximating the learned function by restricting the learning or parameter update as follows:

  • Random Features (RF): wiw_i fixed at random initialization, only aia_i trained: f(x)f(x) is a linear combination of random “basis” functions σ(wix)\sigma(w_i^\top x). This is equivalent to linear regression in a random feature space (Ghorbani et al., 2019, Misiakiewicz et al., 2023).
  • Neural Tangent Kernel (NTK): First-order Taylor expansion of the network around initialization, allowing wiw_i and aia_i to change infinitesimally. The effective function space is the span of all first-order derivatives at initialization:

f(x)f0(x)+i=1N(aia0,i)σ(w0,ix)+i=1Na0,iσ(w0,ix)(wiw0,i)xf(x) \approx f_0(x) + \sum_{i=1}^N (a_i - a_{0,i}) \sigma(w_{0,i}^\top x) + \sum_{i=1}^N a_{0,i} \sigma'(w_{0,i}^\top x) (w_i - w_{0,i})^\top x

In this view, learning dynamics are governed by a fixed NTK, and the network behaves as a linear (kernel) model (Ghorbani et al., 2019, Misiakiewicz et al., 2023).

The population risk R(f)=Ex[(f(x)f(x))2]R(f) = \mathbb{E}_x[(f_*(x) - f(x))^2] forms the basis of quantitative analysis. Both RF and NT models converge to their infinite-width (kernel) analogues under suitable scaling of NN with input dimension dd.

In reinforcement learning or planning applications, the policy πθ(as)exp(ha(s,θ)/τ)\pi_\theta(a|s) \propto \exp(h_a(s,\theta)/\tau) is often used, with the logits ha(s,θ)h_a(s,\theta) given as the output of a wide two-layer neural network (Junyent et al., 2019), and training modifies either the last-layer weights only (RF) or uses the NTK linearization.

2. Expressivity, Approximation Capacity, and Staircase Phenomena

The RF and NTK linearizations exhibit a strict capacity–parameterization–dimension tradeoff. In (Ghorbani et al., 2019), for high-dimensional inputs (xRdx \in \mathbb{R}^d or on the sphere Sd1S^{d-1}) and fixed activation σ\sigma, two key regimes are identified:

  • Approximation-Limited: n=n = \infty, NN polynomial in dd. The RF model with N[d+δ,d+1δ]N \in [d^{\ell+\delta}, d^{\ell+1-\delta}] can fit all degree-\ell polynomials in xx, but not higher degree. NTK linearization can fit degree-(+1)(\ell+1) polynomials, matching the added derivative in the expansion, but still missing higher-order components.
  • Sample-Limited (Kernel) Regime: NN\to\infty, nn polynomial in dd. Kernel ridge regression with NTK (or RF kernel) can fit at most degree-\ell polynomials if n[d+δ,d+1δ]n \in [d^{\ell+\delta}, d^{\ell+1-\delta}].

This "staircase" effect strictly limits the kinds of behaviors linearized wide two-layer neural policies can accomplish for a given problem complexity and parameter/sample budget. It formalizes observed "gaps" between the capacity of lazy linearized policies and that of fully-trained, nonlinear policies—even for targets that are trivial to fit with a single nonlinear neuron but not within the linearized regime (Ghorbani et al., 2019, Misiakiewicz et al., 2023, Ghorbani et al., 2019).

The choice of NN (width) is thus key—underprovisioning relative to dd leads to limited expressivity. In RL and planning settings, this relates directly to which distributions or value/policy function classes the policy can approximate and hence which decision boundaries or exploration strategies can be realized.

3. Generalization, Bias, and Frequency Principle

Wide, linearized two-layer networks display an implicit bias: they favor smoother (low-frequency) function classes over highly oscillatory ones. The "Frequency Principle" (F-Principle), formalized in (Zhang et al., 2019), asserts that these networks fit low frequencies first and penalize high frequencies. The optimization is equivalent to minimizing an explicit quadratic FP-norm over Fourier components:

hhiniFP2=[1Ni(ri(0)2+wi(0)2)ξ(d+3)+4π2Niri(0)2wi(0)2ξ(d+1)]1h^(ξ)h^ini(ξ)2dξ\|h - h_{\text{ini}}\|^2_{FP} = \int \left[ \frac{1}{N} \sum_i (|r_i(0)|^2 + w_i(0)^2) |{\xi}|^{-(d+3)} + \frac{4\pi^2}{N} \sum_i r_i(0)^2 w_i(0)^2 |{\xi}|^{-(d+1)} \right]^{-1} \left| \hat{h}(\xi) - \hat{h}_{\text{ini}}(\xi) \right|^2 d\xi

The bias grows for target policies rich in high-frequency content, leading to higher generalization error. The explicit FP-norm gives a means for quantifying a priori error bounds and for understanding why, in practical RL or planning tasks with smooth solution structure (e.g., navigation, manipulation), linearized wide two-layer policies perform robustly, but may falter in tasks with subtle, high-frequency actions or rewards.

4. Optimization, Training Dynamics, and Robustness

Training in the linearized regime is convex: for NTK/RF models, optimal solutions can generally be obtained via least-squares, kernel ridge regression, or cross-entropy minimization over policy outputs (Junyent et al., 2019). This makes convergence predictable and efficiently computable. Moreover, in the infinite-width limit, stochastic gradient descent (SGD) dynamics for the parameter empirical measure converge (law of large numbers/central limit theorem) to deterministic/stochastic mean-field PDEs (Descours et al., 2022).

Robustness considerations, however, reveal trade-offs (Dohmatob et al., 2022): in the NTK and RF regimes, test error and robustness (measured as the Dirichlet energy of the policy function) exhibit a zero-sum relationship. As policy approximation improves through linearization, robustness to adversarial perturbations generally degrades, particularly if the initialization scale or structure is suboptimal. Networks trained in the pure NTK regime or with large initialization terms inherit substantial nonrobustness, whereas full nonlinear training can sometimes maintain a more favorable trade-off.

5. Practical Applications and Empirical Findings

Empirical studies validate the performance of linearized wide two-layer neural policies in a variety of domains:

  • Planning from Pixels: In π\pi-IW (Junyent et al., 2019), a wide two-layer network mapping screen inputs to action probabilities learns a feature representation whose last hidden layer activations are binarized and used for width-based novelty pruning in a search tree. This removes the need for hand-designed state features. Notably, π-IW outperforms rollout IW and AlphaZero in certain grid and Atari tasks despite using a shallow architecture, illustrating that linearized policies over rich learned features suffice for challenging domains.
  • High-Dimensional Policy Learning: In RL perceptrons (Patel et al., 2023), a two-layer network’s macroscopic policy learning can be reduced to ODEs describing norm and alignment evolution of weights with the teacher. This enables closed-form derivation of optimal learning rate and curriculum annealing schedules, which were empirically shown to explain trade-offs in speed and final accuracy in high-dimensional settings, matching behaviors observed in trained deep policies.
  • Geometric Structure and Low-Effective-Dimension: In continuous control, training a linearized (wide) two-layer neural policy via policy gradient/actor-critic methods induces a low-dimensional manifold of attainable states—its dimension upper-bounded by 2da+12d_a+1 for dad_a actions (Tiwari et al., 28 Jul 2025). Empirical results in MuJoCo environments confirm learned states concentrate on manifolds of dimension far less than the ambient space, validating theoretical predictions and motivating architectural augmentations (e.g., sparse manifold layers) that improve practical performance.
  • Sample Complexity and Expressivity: Randomization techniques that push network dynamics beyond first-order linearization (“escaping the NTK”) allow the network to align with quadratic or higher-order Taylor terms (Bai et al., 2019). The resulting model shows improved sample complexity (by up to a factor of dimension dd for quadratic, or dk1d^{k-1} for kk-th order approximations) under isotropic input distributions, and enjoys benign optimization landscapes.

6. Limitations, Extensions, and Theoretical Insights

Despite their tractability, linearized wide two-layer neural policies have important limitations:

  • Expressive Power: Linearized models cannot learn target functions requiring genuinely nonlinear feature learning unless sufficiently overparameterized; expressivity is bottlenecked by “polynomial degree” as a function of NN and dd (Ghorbani et al., 2019, Misiakiewicz et al., 2023, Ghorbani et al., 2019). Fully-trained networks escape these constraints, sometimes by orders of magnitude in sample efficiency.
  • Robustness: As established in (Dohmatob et al., 2022), robustness to adversarial or structured noise is generally inferior in the lazy/linearized regime because the policy preserves or even amplifies undesirable properties from initialization.
  • Low-Dimensional Structure: Symmetries in input distribution or target function can collapse dynamics to lower-dimensional PDEs; for odd targets, linear predictors suffice, and in subspace-structured tasks, weight evolution is confined to the low-dimensional structure (Hajjar et al., 2022).

Extensions to adaptive scaling (“mean-field” regime), higher-order Taylor approximations, or incorporation of sparse/local manifold layers ameliorate some expressivity and performance issues, combining linearized analysis for tractability with nonlinear advances for practical efficacy (Bai et al., 2019, Tiwari et al., 28 Jul 2025).

7. Implications for Architecture and Algorithm Design

The collected findings provide guidance for constructing effective linearized wide two-layer neural policies:

  • For tasks with smooth value functions, moderate polynomial scaling of width in dd suffices for both RF and NTK models; for high-frequency or highly nonlinear tasks, either larger widths or escaping the linearized regime is essential (Ghorbani et al., 2019, Misiakiewicz et al., 2023).
  • Binary feature extraction from hidden layers (as in π-IW) enables width-based planners to operate without manual feature design even in high-dimensional pixel spaces (Junyent et al., 2019).
  • Explicit control and tuning of initialization (e.g., as per (Dohmatob et al., 2022)), regularization, and architectural modifications (such as sparse or low-dimensional layers (Tiwari et al., 28 Jul 2025)) can improve both robustness and sample efficiency.
  • Kernel perspectives supply closed-form risk and generalization error estimates for design-time evaluation.
  • When strict feasibility or optimality is required (e.g., in parametric optimization (Bae et al., 2023)), piecewise linear policy approximators built from triangulated training data, then approximated by neural networks, admit formal guarantees on approximate feasibility and suboptimality.

A plausible implication is that, while linearized wide two-layer policies are favorable for tractability and sample-efficient learning in suitably structured tasks, further improvement—particularly for robustness or tasks that require higher-order feature learning—demands architectural or optimization approaches that move beyond this regime.


Key Cited Works: