Deep Neural Feature Ansatz (NFA)

Updated 30 March 2026

Deep Neural Feature Ansatz (NFA) is a unified hypothesis that explains feature learning in deep neural networks by aligning layer weight covariances (Gram matrices) with gradient-based metrics (AGOP).
It shows that during training, the correlation between the Gram matrix and the AGOP increases significantly in both fully connected and convolutional architectures, influencing network design and initialization.
NFA bridges theoretical insights with practical applications, enhancing kernel methods, interpretability, and optimizer design through empirically validated alignment principles.

The Deep Neural Feature Ansatz (NFA) posits a unified mechanism underlying feature learning in deep, overparameterized neural networks, both fully connected and convolutional. According to the NFA, the feature geometry learned within each layer—quantified by the Gram (or covariance) matrix of weights—becomes highly correlated, and often proportional, to the average gradient outer product (AGOP) of the network output with respect to the relevant feature vector (entire input, local patch, or hidden activation) at that layer. This proportionality arises during gradient-based training, empirically holds across a wide range of architectures and tasks, and admits both theoretical and algorithmic consequences for network design, kernel methods, and interpretability (Radhakrishnan et al., 2022, Beaglehole et al., 2023, Beaglehole et al., 2024, Tansley et al., 17 Oct 2025, Wiatowski et al., 2016).

1. Formal Statement and Definitions

The NFA asserts, for a network layer $\ell$ with weight matrix $W_\ell$ and "features" $h_{\ell-1}(x)$ (input or hidden activations), that the feature Gram matrix,

$F_\ell = W_\ell^\top W_\ell,$

aligns with the AGOP evaluated over a dataset:

$G_\ell = \mathbb{E}_{x,y}\big[\nabla_{h_{\ell-1}(x)} f(x, y)\,\nabla_{h_{\ell-1}(x)} f(x, y)^\top\big].$

In convolutional networks, $G_\ell$ is typically formed by averaging over the spatial patches of the feature maps (Beaglehole et al., 2023). In fully connected and deep linear networks, $G_\ell$ reduces to an EGOP over input coordinates or hidden activations (Radhakrishnan et al., 2022, Tansley et al., 17 Oct 2025).

The ansatz often holds up to a scalar multiple or an exponent. For instance, in deep linear networks of depth $L$ with balanced initialization, it is shown that

$F_1 \propto G^{1/L}$

where $F_1$ is the Gram of the first layer and $G = \mathbb{E}_x\big[\nabla_x f(x)\,\nabla_x f(x)^\top\big]$ (Tansley et al., 17 Oct 2025). In convolutional architectures, $F_\ell$ matches the AGOP aggregated over patches in spatial dimensions (Beaglehole et al., 2023). The proportionality constant(s) depend on architecture, depth, learning rate, and initialization.

2. Theoretical Foundations and Proof Strategies

Several classes of models admit rigorous proofs of the NFA:

In deep linear networks under gradient flow, if balanced initialization is maintained, the Gram matrix of layer 1 aligns to the $1/L$th power of the network's overall sensitivity (i.e., the AGOP). This is established via matrix power identities: the recursive structure of deep linear models forces $W_1 W_1^\top = (J^\top J)^{1/L}$ , where $J$ is the linear map realized by the full network (Tansley et al., 17 Oct 2025).
In certain scenarios (e.g., one-step gradient descent with zero initialization), the filter Gram in convolutional settings matches the AGOP exactly, as shown by direct expansion of the gradient update. Taylor expansion and moment matching arguments extend this to early multi-step SGD (Beaglehole et al., 2023).
For high-dimensional nonlinear networks, the alignment emerges via SGD-induced coupling between the left singular vectors of $W_\ell$ and the tangent feature kernel of the pre-activations. As widths grow, these alignments become almost surely perfect (Beaglehole et al., 2024).
For general convolutional architectures satisfying frame-like and Lipschitz conditions on their filters, nonlinearities, and pooling operators, discrete deep feature extraction theory establishes global and per-layer stability properties. This formalism justifies a general neural feature ansatz on functional, not matrix, grounds (Wiatowski et al., 2016).

Critically, the ansatz can fail for nonlinear architectures: explicit counterexamples show that there is no universal exponent $\alpha>0$ ensuring $F_\ell \propto G_\ell^\alpha$ in networks with nonlinearities, even when exact data fitting is achieved (Tansley et al., 17 Oct 2025).

3. Empirical Evidence and Correlational Behavior

Empirical studies confirm that, during training:

At initialization, the correlation $\rho(F_\ell, G_\ell)$ is near zero.
During and after training, $\rho(F_\ell, G_\ell)$ typically rises rapidly and stabilizes near 1.
In modern convolutional networks (AlexNet, VGG, ResNet), the entrywise correlation between learned filter covariances and the AGOP in each convolutional layer exceeds 0.9—while correlations between initial and final covariances remain much lower (<0.3) (Beaglehole et al., 2023).
Nonlinear fully connected networks exhibit the same pattern: feature matrix–AGOP alignment arises spontaneously under SGD for a wide range of architectures, optimizers, and initializations (Radhakrishnan et al., 2022, Beaglehole et al., 2024, Tansley et al., 17 Oct 2025). In deep linear networks, the exponent in $F_1 \propto G^{1/L}$ is verified across depths $L=2$ to $5$ (Tansley et al., 17 Oct 2025).

Visualization of Gram matrices and AGOPs in convolutional models shows the emergence of structured features (e.g., Gabor-like edge detectors; consistent spectral structure), even for very deep or highly engineered networks (Beaglehole et al., 2023).

4. Algorithmic Consequences and Practical Applications

The NFA provides a concrete foundation for both theoretical understanding and practical advances:

Feature learning in kernel machines: The NFA can be used to augment kernel methods with data-adaptive feature learning via the AGOP. The Recursive Feature Machine (RFM) alternates kernel regression fits with AGOP-based metric updates, upgrading classical kernels to match learned networks on tabular tasks (Radhakrishnan et al., 2022). In convolutional settings, the (Deep) Convolutional Recursive Feature Machine (ConvRFM) utilizes patchwise AGOPs to adapt convolutional kernels, attaining generalization on par with deep CNNs (Beaglehole et al., 2023).
Initialization and architecture design: Initializing filters with covariances informed by the AGOP, rather than isotropic random draws, can accelerate early training, especially in convolutional networks (Beaglehole et al., 2023).
Interpretability: AGOP computation on trained models illuminates the specific patch or activation directions to which the model is most sensitive, extending classical feature visualizations to arbitrary layers and architectures (Beaglehole et al., 2023, Radhakrishnan et al., 2022).
Practical optimizers: New layerwise update rules (e.g., "speed-layer optimizer") designed to maximize the change per SGD step enhance NFA alignment and empirically improve feature quality (Beaglehole et al., 2024).
Explaining deep learning phenomena: The NFA accounts for the emergence of simplicity/spurious feature biases, the lottery ticket effect (sparsity and performance upon pruning), and phase transitions in generalization ("grokking") (Radhakrishnan et al., 2022).

5. Comparative Table of NFA Manifestations

Context	NFA Statement	Proven/Empirical Law
Deep linear networks	$F_1 \propto G^{1/L}$	Theoretical (any $L$ , balanced init) (Tansley et al., 17 Oct 2025)
Fully-connected nets	$F_\ell \propto G_\ell$	Empirical (early SGD, wide nets) (Radhakrishnan et al., 2022, Beaglehole et al., 2024)
CNNs (conv layers)	$Cov(W_\ell) \propto$ AGOP $_\ell$	Empirical $r > 0.9$ (ImageNet nets) (Beaglehole et al., 2023)
Kernel methods (RFM)	$M \leftarrow$ AGOP update	SOTA tabular performance (Radhakrishnan et al., 2022)
Nonlinear networks	$F_\ell \propto G_\ell^{\alpha}$	Fails in some ReLU/oscillatory cases (Tansley et al., 17 Oct 2025)

6. Limitations, Open Problems, and Theoretical Boundaries

The NFA is not universally valid. For architectures with nontrivial nonlinearities, explicit counterexamples demonstrate that feature Gram–AGOP alignment does not always occur, even with perfect fitting (Tansley et al., 17 Oct 2025). The dependence of alignment strength on finite width, mini-batch stochasticity, optimizer choice, and learning schedules is not yet fully characterized.

In linear cases, the depth-dependent exponent ( $\alpha=1/L$ ) introduces a trade-off: shallow networks focus feature learning at earlier layers, while deeper networks distribute representation over more layers, potentially affecting the sample complexity and interpretability of learned features (Tansley et al., 17 Oct 2025).

A rigorous link between NFA alignment and generalization remains an open problem. High AGOP–Gram alignment is neither necessary nor sufficient for good test accuracy, particularly in the presence of spurious correlations or data structure incompatible with the network architecture.

7. Broader Connections and Theoretical Implications

The neural feature ansatz unifies empirical observations regarding feature selection, inductive bias, and signal adaptivity in deep learning. It connects kernel methods and neural networks via shared AGOP-based adaptation principles (Beaglehole et al., 2023, Radhakrishnan et al., 2022), and provides a mechanistic explanation for the success of specific design choices (e.g., small $3\times3$ kernels, max-pooling) (Beaglehole et al., 2023).

Theoretical frameworks—ranging from frame theory in deep feature extraction (Wiatowski et al., 2016) to spectral analysis of SGD dynamics (Beaglehole et al., 2024)—highlight the centrality of first-order gradient information in shaping the representations learned by deep models. This suggests that the NFA may serve as a foundational tool for future advances in training algorithms, robust initialization, and interpretable deep networks.