Deep Neural Feature Ansatz Overview

Updated 17 May 2026

Deep Neural Feature Ansatz is a framework that connects the geometry of weight matrices with network sensitivity, exemplified by average gradient outer products (AGOP).
It establishes an algebraic proportionality between Gram matrices and AGOPs, offering insights across linear, nonlinear, and convolutional architectures.
Empirical validations and theoretical analyses demonstrate that depth, kernel adaptations, and optimization schemes critically shape spectral alignment and feature learning dynamics.

The Deep Neural Feature Ansatz (NFA) posits a precise algebraic relationship between the geometry of weight matrices in deep neural networks and the sensitivity of the network output to its inputs, as measured by average gradient outer products (AGOPs). The NFA encapsulates the principle that feature learning in deep (and potentially wide) networks can be understood by analyzing the spectra and alignment of these matrices. Formulations of the NFA range from strictly algebraic proportionalities in linear networks to broader kernel-level and field-theoretic statements in nonlinear and finite-width regimes. Variants including the Convolutional Neural Feature Ansatz (CNFA), kernel-based perspectives, and forward-backward self-consistent equations provide a unifying framework for recent advances in understanding feature adaptation in neural architectures (Tansley et al., 17 Oct 2025, Radhakrishnan et al., 2022, Beaglehole et al., 2023, Fischer et al., 2024).

1. Formal Statement and Mathematical Formulation

Let $f: \mathbb{R}^d \to \mathbb{R}$ be a neural network with first-layer weight matrix $W^{(1)} \in \mathbb{R}^{k \times d}$ . The NFA asserts that after training: $G := W^{(1)} W^{(1)\top} \propto M^\alpha,$ where $M := \frac{1}{N} \sum_{i=1}^N \nabla f(x_i) \nabla f(x_i)^\top$ is the AGOP over a dataset $\{x_i\}$ , and $\alpha > 0$ is a network-architecture-dependent exponent. In essence, the Gram matrix $G$ encodes learned input features, while $M$ measures input sensitivity directions. The eigenvectors of $G$ are predicted to align with those of $M$ , with the spectral relationship parameterized by $W^{(1)} \in \mathbb{R}^{k \times d}$ 0. The depth dependence is sharp: in linear networks of depth $W^{(1)} \in \mathbb{R}^{k \times d}$ 1, $W^{(1)} \in \mathbb{R}^{k \times d}$ 2 (Tansley et al., 17 Oct 2025).

In convolutional architectures, CNFA asserts an analogous proportionality for filter covariances and localized patch-based AGOPs: $W^{(1)} \in \mathbb{R}^{k \times d}$ 3 where the AGOP is computed across all spatial patches and averaged over data (Beaglehole et al., 2023).

2. Theoretical Foundations: Linear and Nonlinear Regimes

For fully linear, $W^{(1)} \in \mathbb{R}^{k \times d}$ 4-layer networks initialized in a balanced regime (i.e., $W^{(1)} \in \mathbb{R}^{k \times d}$ 5), the NFA is provably exact throughout gradient flow training: $W^{(1)} \in \mathbb{R}^{k \times d}$ 6 where $W^{(1)} \in \mathbb{R}^{k \times d}$ 7 is the output Jacobian AGOP (Tansley et al., 17 Oct 2025). This exponent generalizes the two-layer result ( $W^{(1)} \in \mathbb{R}^{k \times d}$ 8) to arbitrary depth and is robust to training settings provided balancedness (or its asymptotic restoration via weight decay) is maintained (Tansley et al., 17 Oct 2025).

However, counterexamples in nonlinear settings (e.g., shallow ReLU networks on simple functions) show that the strict algebraic proportionality may fail—despite perfect data interpolation—due to gradient discontinuities or symmetries in the gradient distribution (Tansley et al., 17 Oct 2025). In nonlinear, infinite-width settings, a more general layerwise ansatz holds: neural feature matrices (e.g., $W^{(1)} \in \mathbb{R}^{k \times d}$ 9) are proportional to the AGOP at each layer, under gradient-independence approximations (Radhakrishnan et al., 2022).

For convolutional networks, a one-step gradient-descent argument establishes the CNFA in the small initialization and full-batch MSE loss regime, with empirical evidence demonstrating high correlation ( $G := W^{(1)} W^{(1)\top} \propto M^\alpha,$ 0) between learned filter covariances and patchwise AGOPs across various architectures and stages of training (Beaglehole et al., 2023).

3. Depth Dependence and Feature Learning Dynamics

The architectural depth directly controls the NFA exponent ( $G := W^{(1)} W^{(1)\top} \propto M^\alpha,$ 1). This depth scaling fundamentally modifies the structure of feature learning: shallow networks concentrate feature-magnitude more sharply onto the principal directions of output sensitivity (corresponding to large $G := W^{(1)} W^{(1)\top} \propto M^\alpha,$ 2), while deeper networks disperse feature-magnitudes across the gradient spectrum, slowing down convergence and stabilizing spectral alignment (Tansley et al., 17 Oct 2025). Empirically, the rate at which Gram-AGOP alignment is achieved decays with depth, indicating a slow integration and diffusion of input-gradient structure.

Furthermore, in practical training, this slow alignment is accelerated by optimization schemes that use momentum (SGD-M, Adam), although the limiting exponent remains set by the depth (Tansley et al., 17 Oct 2025). Even with unbalanced initializations, adding weight decay ensures eventual convergence to the NFA, with the rate characterized by the decay parameter.

4. Kernel and Field-Theoretic Generalizations

Recent theoretical advances link the NFA to adaptations in the kernel structure of finite-width deep networks. Within a Bayesian framework, the prior over network outputs equates to a mixture over Gaussian processes with a distribution of kernels whose variance scales inversely with network width (Fischer et al., 2024). Conditioning on data, the posterior kernel adapts via a competition between the prior (resisting deviation from the infinite-width NNGP) and the log-likelihood (favoring alignment with output targets). This equilibrium produces a pair of self-consistent forward–backward equations for layerwise kernels, generalizing the NFA to data-adapted kernel evolution. The field-theoretic analysis shows that the amplitude of feature-learning corrections is governed by width-dependent kernel fluctuations and is maximized at critical initialization (“edge-of-chaos”), thus demarcating the regime where emergent feature learning departs most strongly from the lazy kernel limit (Fischer et al., 2024).

From the convolutional perspective, the AGOP-driven adaptation of Mahalanobis metrics in kernel machines recovers edge-like filter structure and can be formalized in recursive feature learning algorithms (e.g., ConvRFM) that operate without backpropagation (Beaglehole et al., 2023).

5. Algorithmic Developments and Empirical Validation

The practical instantiations of NFA-inspired learning include Recursive Feature Machines (RFM) for fully connected architectures and ConvRFM for convolutional networks (Radhakrishnan et al., 2022, Beaglehole et al., 2023). These algorithms alternate between fitting predictors and updating feature matrices or Mahalanobis metrics through AGOP computation. In kernel regression settings, iterated AGOP adaptation enables classical kernel machines to achieve performance on par with or better than gradient-trained neural networks, especially on tabular and low-rank data domains. On high-dimensional and image datasets, the Mahalanobis kernels learned by ConvRFM match the features (e.g., edge detectors) that emerge in corresponding trained convolutional neural networks (Beaglehole et al., 2023).

Empirically, the alignment between first-layer Grams and powered AGOPs (e.g., cosine similarity peaking at $G := W^{(1)} W^{(1)\top} \propto M^\alpha,$ 3), the ability of RFM to select correct low-dimensional subspaces, and the emergence of interpretable edge-like features in ConvRFM collectively validate the core tenets of the NFA across architectures and data settings (Tansley et al., 17 Oct 2025, Radhakrishnan et al., 2022, Beaglehole et al., 2023).

6. Limitations, Domain of Validity, and Extensions

The NFA, in its strict proportionality form, is exact only in fully linear architectures or under approximations such as infinite width and gradient independence. Nonlinearities, finite data, and stochastic/mini-batch training can induce discrepancies, with explicit counterexamples illustrating NFA failure even with zero training error (Tansley et al., 17 Oct 2025). For convolutional architectures, non-stationarity of data patches and deviations from the mean-field assumptions may reduce filter-AGOP proportionality.

Current theoretical analyses do not fully generalize the one-step or mean-field arguments to non-Euclidean data or architectures utilizing batch normalization, attention, or skip connections (Beaglehole et al., 2023). Practical implementations of kernel-based algorithms face computational bottlenecks due to quadratic scaling in AGOP computations and kernel evaluations.

Extensions to field-theoretic and kernel flow settings provide a unifying lens, positioning NFA as the first-order description of kernel adaptation dynamics at finite width and offering paths for understanding criticality and feature scale in emergent representations (Fischer et al., 2024).

In summary, the Deep Neural Feature Ansatz serves as a fundamental principle organizing the interplay between neural weight geometry and data-driven sensitivity analysis, with rigorous results in linear and certain nonlinear regimes, extensive empirical verification, and deep connections to modern theoretical treatments of neural feature learning (Tansley et al., 17 Oct 2025, Radhakrishnan et al., 2022, Beaglehole et al., 2023, Fischer et al., 2024).