Papers
Topics
Authors
Recent
Search
2000 character limit reached

Patterning: The Dual of Interpretability

Published 20 Jan 2026 in cs.LG | (2601.13548v1)

Abstract: Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it. Our approach is based on susceptibilities, which measure how posterior expectation values of observables respond to infinitesimal shifts in the data distribution. Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration. We demonstrate patterning in a small LLM, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit. In a synthetic parentheses balancing task where multiple algorithms achieve perfect training accuracy, we show that patterning can select which algorithm the model learns by targeting the local learning coefficient of each solution. These results establish that the same mathematical framework used to read internal structure can be inverted to write it.

Summary

  • The paper presents a novel framework that uses susceptibilities to invert neural networks' linear responses and design targeted data interventions.
  • It leverages singular learning theory to connect local free energy, posterior concentration, and internal model complexity during training.
  • Empirical experiments on language models and synthetic tasks validate the method’s ability to control circuit formation and algorithm selection.

Patterning: The Dual of Interpretability

Conceptual Foundations

The paper introduces patterning as the mathematical dual to mechanistic interpretability. While interpretability attempts to understand how neural networks generalize via the reverse engineering of internal structure post-training, patterning focuses on the pre-training problem: given a target form of generalization, what training data distribution will reliably induce it? The approach leverages susceptibilities, which quantify how posterior expectation values of specific observables (functions mapping model parameters to structural features) respond to infinitesimal changes in the data distribution. Through inversion of the linear response relationship encoded by the susceptibility matrix, one can construct targeted data interventions—either one-off or adaptively online—designed to steer the internal structure that forms during training.

This framework is formalized by the equation:

dμ=χdhd\mu^\infty = \chi\, dh

where dμd\mu^\infty represents the change in structural coordinates (expectation values of observables), dhdh the infinitesimal change in data distribution parameters (hyperparameters such as mixture weights), and χ\chi the susceptibility matrix. The fundamental equation of patterning is the minimum-norm intervention:

dhopt=χdμtargetdh_{\text{opt}} = \chi^\dagger\, d\mu^\infty_{\text{target}}

where χ\chi^\dagger denotes the Moore-Penrose pseudoinverse. Therefore, patterning establishes a principled link between desired generalization and the data distribution required to elicit it.

Singular Learning Theory and Susceptibility Analysis

Central to the methodology is singular learning theory (SLT) [watanabe2009algebraic], which provides the local learning coefficient (LLC, λ\lambda) as a measure of solution complexity, dictating posterior preference among multiple zero-loss (empirical risk minimization) solutions. SLT describes the asymptotic concentration of the posterior around parameter minima and relates the quantity of compressible directions (degeneracies) in parameter space to structural features. SLT also gives an asymptotic expansion for local free energy:

Fn(U)=nLn(w)+λ(w)logn+F_n(\mathcal{U}) = nL_n(w^*) + \lambda(w^*)\log n + \cdots

where Ln(w)L_n(w^*) is local empirical loss and λ(w)\lambda(w^*) encodes effective dimension.

Susceptibilities are defined as derivatives of posterior expectation values with respect to the data distribution. In practice, for a model such as a transformer, per-token susceptibilities χxyC\chi^C_{xy} are estimated to capture the sensitivity of components (e.g., attention heads) to token-context pairs. Susceptibility matrices admit singular value decomposition (SVD), coupling principal data patterns (eigenvectors of the data space) to principal structural directions in observable space. Figure 1

Figure 1: PC2 susceptibilities highlight recurrent patterns in the language modeling corpus, clustering rare biological terms and induction motifs in token sequences.

Empirical Validation: Induction Circuit Formation

Patterning is first validated in a language modeling setting using a 3M parameter attention-only transformer. Following prior work, the second principal component (PC2) of the susceptibility matrix is shown to couple data induction patterns with the induction circuit forming in the model weights. By reweighting training tokens along v2v_2 (the right singular vector from SVD), circuit formation can be reliably modulated:

  • Down-weighting tokens with negative v2v_2 values delays or prevents induction circuit formation.
  • Up-weighting accelerates the emergence and strength of the induction circuit.

This prediction is borne out by both susceptibility measurements and extrinsic functional metrics (e.g., prefix matching scores, previous-token scores following [olsson2022context]). Figure 2

Figure 2: Prefix matching and previous-token scores trace the functional induction circuit formation in the LLM for different token reweighting schedules.

Algorithm Selection in Parenthesis Balancing Task

The second experiment applies patterning to a synthetic classification task involving balancing parentheses, where two algorithms—Nested and Equal-Count—are consistent with training but generalize differently out-of-distribution (OOD). SLT predicts that the posterior will prefer the algorithm with lower LLC. Susceptibility measurements local to each solution allow the construction of a susceptibility gap:

Δχx=χxEQχxN\Delta\chi_x = \chi^{\text{EQ}}_x - \chi^N_x

where xx indexes training samples.

Synthetic augmentation of training data:

  • Enriching for "almost nested" samples (which are hard for the Nested algorithm) raises its LLC, favoring Equal-Count solutions.
  • Enriching for "almost equal" samples (selected for high susceptibility gap) shifts complexity and empirical OOD accuracy in the predicted direction.

Training on distributions enriched in "almost nested" samples produced models overwhelmingly employing the Equal-Count strategy (mean OOD accuracy $0.004$), while enriching for "almost equal" samples increased the prevalence of Nested solutions (mean OOD accuracy $0.497$, shift from $0.310$ in the original). Figure 3

Figure 3: Heatmaps illustrate Dyck path crossing patterns in synthetically generated sequences, distinguishing "almost nested" and "almost equal" samples that drive selection between learning algorithms.

Figure 4

Figure 4

Figure 4: LLCs of trained models on original and modified data distributions, showing selective complexity elevation induced by sample reweighting.

Broader Implications and Alignment Applications

The susceptibility-guided patterning paradigm yields multiple implications:

  • Internal selection: Posterior preferences are shaped by the complexity landscape; data interventions enable the steering of internal structure formation.
  • Alignment: Patterning offers principled methods to avoid specification gaming and instrumental convergence by targeting the complexity of unwanted structures.
  • Generalization Control: Empirical and theoretical analyses suggest that precise control over computational motifs, such as circuits or algorithm selection, is feasible via targeted data curation.
  • Scaling Considerations: Computational cost is dominated by forward passes for susceptibility estimation, but methodology scales favorably with model size, and approaches such as online adaptive patterning or surrogate modeling may reduce cost further.

Relation to Existing Work

The approach mathematically formalizes and extends several paradigms:

  • Mechanistic interpretability [olah2020zoom, elhage2021mathematical, wang2023interpretability]: The susceptibility-based analysis captures and generalizes circuit decomposition, directly linking model internals to data perturbations.
  • Influence functions [cook1980characterizations, koh2017understanding, kreer2025bayesianinfluencefunctionshessianfree]: While both assess data impact, patterning focuses on internal complexity measures, not prediction error.
  • Data curation: Existing works optimize downstream performance; patterning targets internal structure regardless of downstream metric.
  • Developmental analogies: Morphogen gradients and bifurcation theory from biology metaphorically underpin the susceptibility-driven modulation of model development [waddington1957strategy, wang2025embryologylanguagemodel].
  • Coherent control: Analogous to spectroscopy and femtochemistry, low-intensity susceptibility measurements are inverted to drive large-scale structural changes in neural networks [brumer1989coherence, shapiro2012quantum].

Conclusion

Patterning operationalizes the dual to interpretability, providing explicit equations and algorithms to induce desired generalization by training data design. Experimental evidence demonstrates that susceptibility-guided interventions produce predictable, measurable changes in the internal structure and functional behavior of neural networks. The methodology is grounded in singular learning theory and leverages mechanistic observables to achieve targeted structural control. Theoretical and empirical limitations remain, particularly regarding scaling, online adaptation, and robustness beyond linear response. Nevertheless, patterning advances the landscape of AI generalization control, with substantial implications for alignment and principled neural engineering.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 397 likes about this paper.