Papers
Topics
Authors
Recent
2000 character limit reached

AGOP in Feature Learning

Updated 8 November 2025
  • AGOP is a statistical operator defined as the empirical mean of outer products of a model’s gradients, identifying key input directions.
  • It guides feature learning by emphasizing dominant input directions that lead to emergent structures and improved generalization.
  • AGOP underpins methods like Recursive Feature Machines and neural networks, revealing progressive adaptation in task-relevant features.

The Average Gradient Outer Product (AGOP) is a statistical operator central to contemporary theories of feature learning and emergence in machine learning. Defined as the empirical mean of the outer products of a model’s output gradients with respect to its input, the AGOP provides a coordinate-free, data-dependent characterization of which directions in the input space are most relevant for the task. Recent work demonstrates that AGOP-driven feature learning is not specific to neural architectures or gradient-based optimization, but instead constitutes a general mechanism underlying both phase-transition phenomena like grokking and the emergence of task-adapted feature structure.

1. Mathematical Definition and Interpretation

AGOP is formally defined for a predictor f:RdRcf:\mathbb{R}^d \to \mathbb{R}^c across a dataset {x(j)}j=1n\{x^{(j)}\}_{j=1}^n as:

AGOP(f;{x(j)}j=1n)=1nj=1nf(x(j))xf(x(j))xTRd×d\mathrm{AGOP}(f; \{x^{(j)}\}_{j=1}^{n}) = \frac{1}{n} \sum_{j=1}^n \frac{\partial f(x^{(j)})}{\partial x} \frac{\partial f(x^{(j)})}{\partial x}^T \in \mathbb{R}^{d \times d}

Here, f(x)x\frac{\partial f(x)}{\partial x} is the d×cd \times c Jacobian of the predictor. The AGOP encapsulates the sensitivity of all model outputs to infinitesimal changes in each input direction, and its top eigenvectors identify the input directions (“features”) to which the model’s predictions are most responsive. This makes AGOP a fundamental object for quantifying and tracking feature learning.

2. AGOP as the Driver of Feature Learning

2.1. Recursive Feature Machines (RFM)

RFM is an iterative algorithm that introduces task-driven feature learning into models lacking adaptive features, such as kernel machines. The update cycle is as follows:

  1. Fit a predictor f(t)f^{(t)} in a feature space parameterized by MtM_t.
  2. Compute AGOP(f(t);x(j))=M\mathrm{AGOP}(f^{(t)}; x^{(j)}) = M.
  3. Update the feature map by transforming inputs via xMs/2xx \rightarrow M^{s/2} x (for a chosen s>0s > 0), thus reweighting and amplifying directions the model’s output depends on most.

2.2. Neural Networks and Feature Covariance

Neural networks autonomously learn hierarchical features. As formalized in the Neural Feature Ansatz (NFA), the AGOP is empirically observed to be highly correlated with the uncentered covariance (Gram) matrix of the layer’s input weights:

W1TW1(AGOP)sW_1^T W_1 \propto (\mathrm{AGOP})^s

for some layerwise exponent ss, where W1W_1 is the input weight matrix. This empirical correlation supports the interpretation of AGOP as a universal mechanism by which both neural and non-neural models adapt to amplify useful, task-relevant features.

3. AGOP, Grokking, and Emergence in Modular Arithmetic

3.1. The Grokking Phenomenon

In modular arithmetic tasks (e.g., learning (x+y)modp(x + y) \bmod p), models frequently experience “grokking”: a sharp, delayed phase transition in test accuracy that occurs well after the model attains zero training loss. Unlike classical overfitting, the training and test losses remain flat for an extended period, then suddenly the test accuracy transitions to near-perfect.

3.2. AGOP as Progress Indicator and Mechanism

In both RFM and neural networks, grokking coincides not with changes in loss, but the emergence of pronounced structure in AGOP: specifically, the AGOP matrix acquires block-circulant submatrices as the phase transition is approached. Empirical results show:

  • The evolution of the AGOP matrix (as measured by AGOP alignment with the final generalizing AGOP, or by deviation from perfect circulant structure) reveals steady progress in feature learning, unseen by traditional metrics.
  • The phase transition in generalization is completely determined by the acquisition of the appropriate block-circulant AGOP structure.

4. Block-Circulant Features and the Fourier Multiplication Algorithm

4.1. Structural Emergence

The AGOP, in both RFM and neural networks trained on modular arithmetic, consistently evolves to a block-circulant form (circulant blocks potentially after row/column reordering). Circulant matrices are closely related to the discrete Fourier basis: their eigenvectors are Fourier modes, and circulant structure enables efficient implementation of modular arithmetic via the Fourier Multiplication Algorithm (FMA).

4.2. Exact Generalization via Circulant Features

Theoretical analysis shows that, once the AGOP is block-circulant, even a fixed kernel machine (i.e., with no further feature learning) can implement the FMA and generalize perfectly on modular arithmetic tasks. The transition to generalization is then fully explained as the model learning the right AGOP-based feature structure, independent of the particulars of the learning architecture or loss minimization dynamics.

Aspect Role of AGOP
Definition Mean outer product of Jacobians; identifies dominant input directions
Feature learning Guides extraction of task-relevant features in RFMs and neural networks
Modular arithmetic Evolves to block-circulant forms implementing Fourier-based solutions
Grokking Hidden progress revealed by AGOP; emergence marked by AGOP structure
Theoretical support Block-circulant AGOP features suffice for generalization via FMA
Empirical support AGOP structure emerges during grokking, independent of loss/accuracy
General implications Emergence traced to AGOP-driven feature learning, not architecture details

5. Empirical and Theoretical Characterization

Empirical evidence demonstrates:

  • Grokking manifests identically in both RFM and neural networks, provided feature learning (as measured by AGOP) is present.
  • Training with pre-imposed block-circulant features or random circulant maps leads to immediate generalization, confirming that obtaining the right AGOP structure is necessary and sufficient for success.
  • Enforcing block-circulant structure in the RFM accelerates or eliminates the grokking phase delay.
  • Hidden progress metrics (circulant deviation and AGOP alignment) evolve smoothly, in contrast to stagnant loss and accuracy metrics until the phase transition.

On the theoretical side, rigorous proofs establish that kernel machines endowed with block-circulant features can implement the FMA and achieve perfect generalization in modular arithmetic, tying the phenomenon directly and exclusively to AGOP structure.

6. Implications for Understanding Emergence in Machine Learning

The analysis of AGOP in modular arithmetic tasks leads to several general conclusions:

  • Emergent generalization (grokking) is not specific to neural architectures or gradient-based learning algorithms. Instead, it is a direct consequence of feature learning, as captured by the evolution of the AGOP.
  • AGOP provides a unifying description and analytic tool for tracking and understanding the development of generalization ability in a broad family of machine learning models.
  • The findings support the view that sharp, emergent transitions belong to the intrinsic dynamics of feature learning, measurable entirely via AGOP, and are independent of superficial details such as training loss trajectories.

7. Synthesis and Future Directions

The identification of AGOP as the governing mechanism for the emergence of generalization and grokking in modular arithmetic tasks marks a significant shift in the interpretation of model learning dynamics. Rather than attributing emergence to model class, loss dynamics, or specific optimization algorithms, the evidence indicates that feature learning—quantitatively characterized by the AGOP matrix and its evolving structure—underpins these phenomena. This insight extends the explanatory scope of AGOP beyond neural networks and gradient descent, suggesting further investigation of AGOP-driven feature learning across broader classes of machine learning architectures and tasks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Average Gradient Outer Product (AGOP).