Convergent Evolution: How Different Language Models Learn Similar Number Representations

Published 22 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.20817v1)

Abstract: LLMs trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that all language models exhibit universal spectral convergence in number representations via characteristic Fourier spikes.
It reveals that geometric convergence, which grants linear separability for modular arithmetic, is sensitive to data structure, architecture, and optimizer choices.
The study introduces new interpretability tools by dissecting how data statistics and model dynamics drive the emergence of functional cyclic representations.

Convergent Evolution of Number Representations in LLMs

Overview

"Convergent Evolution: How Different LLMs Learn Similar Number Representations" (2604.20817) provides a rigorous analysis of the emergence and organization of periodic number representations across a diverse set of LMs and embedding systems. The work delineates a clear two-tiered hierarchy within these representations: spectral convergence, where periodic features (Fourier spikes) arise universally, and geometric convergence, where these features yield linearly separable residue classes for modular arithmetic. The central thesis is that, although all models develop similar periodic structure (convergent evolution), the capacity for modular arithmetic is constrained by data, architecture, optimizer, and tokenization. The authors empirically and theoretically formalize this separation, providing new interpretability tools and insights with substantial implications for mechanistic interpretability and the design of foundation models.

Universality of Spectral Convergence

The authors demonstrate, via comprehensive Fourier analyses, that periodic spikes at periods $T = 2, 5, 10$ are a universal feature of number embeddings across both Transformer and non-Transformer LMs, as well as classical models such as GloVe and FastText. This universality holds even for embeddings derived from raw corpus frequency distributions, emphasizing that periodicity arises from the underlying structure of natural language data and tokenization, not merely model architecture or specific model objectives.

Figure 1: The Fourier spectrum of number embeddings reveals consistent spikes at periods $T=2, 5, 10$ across Transformer, non-Transformer LLMs, and classical embeddings, indicating convergent evolution driven by data statistics.

While this "spectral convergence" is robust, it does not guarantee any functional utility for arithmetic tasks; all systems—including those with trivial data statistics—exhibit Fourier spikes.

Hierarchical Structure: Spectral vs. Geometric Convergence

Through both theoretical results and empirical interrogation, the paper proves that high power at harmonic frequencies (Fourier spikes) is necessary but not sufficient for embeddings to support modular arithmetic via linear probes. While spectral convergence denotes the presence of periodic signals, geometric convergence requires that residue classes modulo $T$ are linearly separable within the embedding space.

Figure 2: Although LSTMs and raw token frequencies exhibit Fourier spikes, only specific architectures achieve high modular probe accuracy, revealing a distinction between spectral and geometric convergence.

The critical theoretical result shows that, for any fixed Fourier energy at harmonics of $T$ , covariance structure within residue classes (the within-class scatter matrix) can obfuscate class separation, causing probe accuracy to collapse to chance even under prominent spectral features. This phenomenon is not pathological but occurs in practical LSTM embeddings and corpus frequency spectra.

Figure 3: Examples constructed to illustrate how large Fourier power $\Phi_T$ can coexist with near-random residue class classification by controlling within-class scatter.

Attribution to Data, Architecture, and Optimization

A major contribution is the attribution of geometric convergence to the interaction of architecture, data, and optimizer. Using controlled dataset perturbations that ablate co-occurrence structure, cross-number interaction, and context length, the authors show:

Spectral convergence is invariant to most data perturbations: Even when number tokens are replaced with i.i.d. samples from their marginal distribution, periodicity is preserved.
Geometric convergence is highly sensitive: Removing associations with textual context, restricting attention window size, or isolating number tokens rapidly degrades modular probe accuracy.
Figure 4: Data perturbations preserve Fourier structure but strongly impact modular probe accuracy, indicating distinct mechanisms behind spectral and geometric convergence.
Architecture and optimizer effects: Transformers, linear RNNs (Gated DeltaNet, Mamba-2) achieve both spectral and geometric convergence under varied optimizers (AdamW, Muon), whereas LSTMs—even deep variants—completely fail at geometric convergence, despite robust Fourier features.
Figure 5: All architectures yield similar spectral structure, but only select architectures and optimizers achieve strong geometric convergence.
Optimizer interaction: The Muon optimizer improves modular probe accuracy for Transformers and DeltaNet, but this effect is not universal (e.g., AdamW yields stronger geometric convergence for Mamba-2).

Role of Task Structure and Tokenization

The analysis is extended to models trained directly on synthetic arithmetic problems (e.g., addition). The findings are:

Multi-token arithmetic (e.g., 9-digit addition): Both spectral and geometric convergence emerge robustly, independent of optimizer; modular probe accuracy saturates for moduli related to the base (mod 2, 5, 10).
Single-token arithmetic (e.g., 3-digit addition): No consistent convergence is observed; periodic structure and probe accuracy become dependent on random seed and optimizer, with models often failing to generalize modular arithmetic.
Figure 6: Multi-token addition drives universal convergence to structured representations; single-token addition leads to unconstrained and seed-dependent results.

Training Dynamics

Unlike previously reported "grokking" phenomena in synthetic modular arithmetic, where structure appears suddenly, both tiers of convergence emerge smoothly and continuously during natural language pretraining, further emphasizing their reliance on statistical regularities rather than explicit algorithmic content.

Figure 7: Both Fourier power and probe accuracy increase gradually during pretraining, with no sharp phase transition.

Probing and Visualization of Learned Representations

The authors employ a mixture of probing techniques—including linear, MLP, and circular probes—to uncover the geometry of learned embeddings, confirm the clock-like organization for multi-token arithmetic, and show the absence of angular structure in undriven (single-token) settings.

Figure 8: Embeddings from models trained on 9-digit addition form well-separated circular structures by mod-10 class; 3-digit addition yields no such geometry.

Theoretical Implications and Future Directions

This work introduces a robust structural-attribution paradigm for analyzing LM representations: rather than solely interpreting model outputs or relying on instance-level attributions, one can decompose feature emergence into the contributions of data, architecture, and model dynamics. The dissociation between spectral and geometric convergence provides a new framework to prevent misinterpretation of diagnostic signals—a critical step in mechanistic interpretability. Furthermore, the analysis suggests that visible periodic signals in learned representations may substantially overestimate a model's capacity for structured reasoning about cyclic concepts.

Analogous periodic geometric features are known to emerge for other cyclic domains (days of the week, months, etc.), raising questions about the generality of the spectral-geometric hierarchy beyond numbers. The findings also suggest architectural and data-driven pressures lead to emergence of similar representations even in the absence of explicit supervision, connecting with theoretical work on alignment and the "Platonic" organization of model features.

Conclusion

This study rigorously differentiates between statistical artifact (spectral convergence) and functional modularity (geometric convergence) in number representations in LMs. Convergent evolution of periodic features is driven fundamentally by data statistics and tokenization, while the emergence of functional, linearly decodable modular structure is contingent on a precise alignment of data, architecture, and optimizer. These results have direct implications for interpretability of neural representations, probing methodology, and the design and pretraining of future foundation models.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper asks a simple question: when computers learn language, how do they learn about numbers? The authors find that many different kinds of LLMs, trained in different ways, end up representing numbers with repeating patterns tied to our base‑10 number system. Even more interesting, they show there are two levels of this “convergent evolution”:

Spectral convergence: models show clear repeating (periodic) patterns for numbers (like rhythms in music), especially with periods 2, 5, and 10.
Geometric convergence: models organize number meanings so cleanly that a simple tool can tell, just from a number’s internal representation, what its remainder is when divided by 2, 5, or 10 (that’s “mod T”).

They prove that the first level (seeing the patterns) doesn’t guarantee the second (organizing them so they’re easy to use). Then they test what data, model designs, training choices, and tokenization make the second level happen.

What questions did the authors want to answer?

In plain terms:

Do different kinds of LLMs learn the same repeating patterns for numbers?
Do those patterns mean the model actually understands useful number properties, like whether a number is even or what it is mod 10?
What makes this happen: the data, the model type, the training method, or how we split numbers into tokens (tokenization)?
If we train only on arithmetic (like addition), do the same patterns and usable structures appear?

How did they study this?

To keep things simple, here’s what their tools and tests mean:

What is a “number embedding”?

Think of each number (like “37”) getting turned into a small vector—a list of numbers—that the model uses internally, like a compact “ID card” for that token. That ID card is called an embedding.

What is a “Fourier spike” and “period T”?

Imagine lining up number embeddings from 0 to 999 and “listening” for rhythms in how those vectors change (like using a music analyzer). The “Fourier transform” is that analyzer: it finds repeating patterns.
A “spike at period T” means a strong, repeated pattern every T numbers. For example:
- T = 2: the model picks up the even/odd rhythm.
- T = 5 or T = 10: patterns tied to base‑10 digits and how we write and talk about numbers.

What does “mod T” mean?

“Mod T” means the remainder after dividing by T. For T = 10, the remainder is the last digit. For T = 2, it’s whether the number is even or odd.

What is a “linear probe”?

A very simple classifier (like drawing straight lines to separate groups) that tries to read a property (like “what’s this number mod 10?”) from the embeddings. If a linear probe succeeds, it means the concept is cleanly and simply organized inside the model.

What experiments did they run?

They checked many model types: Transformers, modern RNNs (like Mamba, Gated DeltaNet), classic LSTMs, and even older methods like word2vec and GloVe.
They changed the training data in controlled ways to remove specific signals (for example, keeping number frequencies but shuffling which numbers appear with which words).
They switched optimizers (ways of updating a model during training) like AdamW and Muon.
They compared normal language pretraining with training only on addition.
They tested different tokenizations: numbers as single tokens (like “472”) vs. multi-token numbers (splitting long numbers into chunks), which affects what subproblems the model must solve.

What did they find, and why is it important?

Here are the main results, explained simply:

Many systems show the same repeating patterns (spectral convergence).
- Transformers, newer RNNs, old word embeddings—and even just the raw frequency of numbers in text—all have strong “music-like” spikes at periods 2, 5, and 10.
- This happens because of how numbers appear in natural language and because we use base 10. It’s like different species independently evolving eyes because they all live in places where seeing is useful.
But seeing patterns isn’t the same as organizing them for use (geometric convergence).
- Some models (Transformers, linear-style RNNs) organize number embeddings so well that a simple linear probe can read mod‑2, mod‑5, or mod‑10.
- Other models (notably LSTMs) can have even bigger Fourier spikes but still fail at simple mod‑T classification. In other words: the rhythm is there, but it’s buried under messy noise that makes it unusable by simple tools.
They proved why spikes don’t guarantee usable structure.
- In math terms: having big periodic power (the spikes) is necessary but not sufficient for clean, linearly separable classes. If the “within-class noise” is large in the wrong directions, the classes overlap and simple classifiers can’t separate them.
- Intuition: Imagine you sort marbles into 10 bins (for mod‑10), but then you shake the table so much that marbles from different bins spill into each other. A noise-sensitive setup can hide the structure even if it exists on average.
What determines whether usable structure emerges?
- Data signals:
- Just keeping number frequencies (and destroying co-occurrences) still gives Fourier spikes—but mod‑T decoding collapses to chance.
- Keeping associations between numbers and their surrounding words helps.
- Letting multiple numbers appear together in a context also helps (numbers interacting in the same sentence).
- Longer context windows (seeing more surrounding tokens) helps more.
- Architecture:
- Transformers and some linear RNNs achieve strong geometric convergence.
- LSTMs do not, even with clear Fourier spikes.
- Optimizer:
- The choice (Muon vs. AdamW) matters in nuanced ways: it shifts how well geometric convergence happens, and the best choice depends on the architecture. Spectral convergence shows up either way.
Training on arithmetic shows a second route to convergence—and tokenization is the key.
- Multi-token addition (long numbers split into chunks) forces the model to solve “carry” subproblems digit by digit. That naturally creates mod‑1000 and related mod‑T structures, so spikes appear and mod‑T is easy to decode—no matter the optimizer.
- Single-token addition (each number is one token) doesn’t force any modular substructure. There’s no pressure to build those circular/repeating features, so sometimes the model learns them, sometimes not. Results vary by optimizer and random seed.
- Bottom line: if the task demands modular thinking (because of how numbers are split), the model converges on usable, shared representations. If not, it may not.

Why does this matter?

It warns us: seeing a pattern in model embeddings doesn’t mean the model can use it simply. Pretty pictures of structure can be misleading.
It gives a checklist for building better number understanding:
- Use data that connects numbers with text and with other numbers, and give the model enough context.
- Choose architectures (like Transformers or linear-style RNNs) that tend to produce clean, usable organization.
- Use tokenization and tasks that force modular subproblems (like multi-token numbers in arithmetic), which naturally shape the right internal structures.
It suggests a general lesson beyond numbers: for any concept (like days of the week, months, or other cycles), we should distinguish “spectral” signs of a pattern from “geometric” organization that makes it functionally usable. That helps us tell superficial learning from real, actionable understanding.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of unresolved issues that limit the paper’s claims or open directions for follow-up work:

Missing sufficient conditions: While the paper proves Fourier-domain power at period T is necessary but not sufficient for mod-T linear separability, it does not provide constructive or verifiable sufficient conditions (in terms of data distribution, architecture, or training dynamics) that guarantee geometric separability.
Theory-to-training link: No theoretical bridge predicts when specific data/optimizer/architecture choices will yield favorable within-class scatter (low cond(S_W)) so that spectral signals translate into strong λ_max(S_W^{-1} S_B) and high probing accuracy.
Origin of T ∈ {2, 5, 10} spikes: The paper documents universality of these spectral peaks but does not causally explain why natural corpora (and tokenization) produce this periodic structure; a corpus-level, cross-domain, and cross-lingual analysis is missing.
Tokenizer generality: Results hinge on the Llama-3 tokenizer (single tokens for 0–999). It remains unknown how conclusions change for byte-level tokenizers, different BPE/unigram vocabularies, digit-level tokenizations, or languages with distinct numeral morphology (e.g., Chinese, Arabic-Indic), and for spelled-out numbers.
Ordering assumption in Fourier analysis: The DFT is computed over numbers ordered by numeric value. Robustness to token ID permutations, missing numbers, or non-contiguous numeric vocabularies is not evaluated.
Scale dependence: Controlled experiments use ~300M-parameter models; the scaling behavior (parameters, data size) of spectral vs geometric convergence, including thresholds and scaling laws for κ vs tokens/params, is not characterized.
Architecture-specific mechanism: LSTMs show strong spectral spikes but fail probing; the paper does not mechanistically explain why (e.g., gate dynamics, representational anisotropy sources, depth/width effects) or whether architecture tweaks (LN-LSTMs, attention-augmented LSTMs) could remedy this.
Optimizer implicit bias: The optimizer dependence (Muon vs AdamW) is documented but not explained; a systematic study over hyperparameters (learning rate schedules, weight decay, norm constraints, clipping, batch size) and their interaction with architecture is missing.
Layer-wise representations: Analyses focus on token embeddings; it is unknown how mod-T features evolve across layers, whether geometric separability emerges early or late, and whether deeper activations exhibit stronger/nonlinear decodability than embeddings.
Nonlinear structure: While linear probes are primary, the extent to which moduli (e.g., 3, 7, 9) are nonlinearly decodable (kernel/MLP/circular probes) is not comprehensively mapped, nor are the representational geometries for these harder moduli analyzed.
Distinguishing frequency artifacts from learned structure: Beyond showing that unigram frequencies can induce spikes, the paper does not provide a general diagnostic pipeline (e.g., reweighting/whitening, control spectra) to disentangle frequency-driven artifacts from functional features in practice.
Comprehensive modulus sweep: The study emphasizes T ∈ {2, 5, 10} (and 4). A systematic chart of κ over a wide range of T (including divisors of 1000 like 8, 20, 25, 40, 50, 100, 125 and non-divisors) in pretraining is missing, making it hard to separate tokenizer vs data effects.
Cross-number interaction granularity: The Isolate-k and context-length perturbations are coarse. More granular, synthetic controls over inter-number distance, order, and co-occurrence topology could isolate which interaction patterns most contribute to geometric convergence.
Cross-lingual/multidomain robustness: Universality is claimed across some pretrained models but not systematically tested across languages, numeric conventions (thousand separators, date formats), domains (code, math corpora), or multimodal settings.
Downstream relevance: It remains untested whether higher κ (geometric convergence) correlates with better numeric reasoning or arithmetic performance on downstream benchmarks; functional significance beyond probe accuracy is unclear.
Addition with untied embeddings: The claim that tied embeddings induce modular pressure in multi-token addition is plausible but unverified; whether modular features still emerge with untied input/output embeddings or alternative output heads remains open.
Task/objective dependence: Only autoregressive next-token prediction is studied; whether masked LMs, sequence-to-sequence training, or other objectives alter spectral/geometric convergence is unknown.
Dynamics of S_W: Training dynamics show smooth co-emergence, but the evolution of the within-class scatter spectrum (eigenvalues, condition number) and its alignment with periodic signal directions over training is not measured.
Reconciling digit-based and Fourier-based views: Prior work finds digit-level/base-10 representations. The relationship between digit-structured features and circular/Fourier features (coexistence, transformations between them) is not established.
Mechanistic circuits: The paper does not connect embedding-level geometry to internal computational circuits (e.g., attention heads/MLPs implementing rotations) in the controlled pretraining runs; a causal link between geometry and mechanisms is missing.
Arithmetic without modular pressure: In single-token addition, optimizer/seed-sensitive spectra appear, but the paper does not explore interventions (e.g., synthetic mod constraints, curriculum, regularizers) to induce convergence or explain why some seeds grok and others do not.
Robustness of spectral metrics: Median-normalized Fourier power may be sensitive to analysis choices; robustness to alternative normalizations, windowing, detrending, and frequency-domain statistical tests is not reported.
Post-processing to recover geometry: It is unknown whether simple post-processing (e.g., whitening by S_W^{-1/2}, PCA within harmonic subspaces, metric learning) can convert “spectral-only” embeddings (e.g., LSTM) into geometrically separable ones, clarifying what training must achieve.
Regularization effects: The influence of explicit regularizers (orthogonality, spectral penalties, embedding norm constraints, dropout on embeddings) on cond(S_W) and geometric convergence is not explored.
Reproducibility and variance: While multiple seeds are used in places, comprehensive confidence intervals and variance analyses for κ and Φ_T across seeds/runs and compute settings are limited; exact scripts for data perturbations and training are not detailed in the main text.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The paper’s findings support several concrete actions that can be adopted now to improve model reliability with numbers, audit training data, and design better training/evaluation workflows.

Numeracy diagnostics for model development (Sector: software/AI) — Package Fourier-spectrum + mod-T linear-probe checks (Cohen’s κ, Fisher LDA eigenvalues) as a “Numeracy Eval Pack” integrated into training dashboards and eval harnesses; use thresholds on κ for T∈{2,5,10} as release gates; add a “Numeracy Annex” to model cards. — Assumptions/dependencies: access to number-token embeddings (or tied input/output embeddings), single-token coverage for 0–999 (or equivalent), and sufficient number-token counts.
Tokenization-aware prompt/data formatting to improve arithmetic reliability (Sectors: software, finance, education) — Prefer multi-token formats that segment numbers into digit groups (e.g., triads) in tasks requiring addition/carry; ensure numbers are adjacent in context to induce cross-number interaction; keep consistent base-10 formatting. — Assumptions/dependencies: control over tokenization or input formatting; tied embeddings amplify benefits; applicable with common LLM tokenizers.
Architecture selection guidelines for numeric tasks (Sectors: software, robotics, finance) — Choose Transformers or linear RNNs (e.g., Mamba-2, Gated DeltaNet) over LSTMs when reliable modular arithmetic features are required; use longer context windows where possible. — Assumptions/dependencies: similar parameter budgets; training on text with natural number co-occurrence; availability of long-context training/inference.
Optimizer choice as a lever (Sector: software/AI) — Prefer Muon for Transformers and (often) Gated DeltaNet when targeting strong mod-10 separability; validate empirically for state-space models (e.g., Mamba-2 sometimes favors AdamW). — Assumptions/dependencies: optimizer availability, stability with current infra; effect is architecture-dependent and should be validated on the target data.
Data pipeline “knobs” to preserve/use co-occurrence signals (Sectors: software/AI, education) — Avoid unigram replacements or aggressive de-identification that randomizes numbers; maintain text–number co-occurrence and cross-number interaction via packing strategies; ensure sufficient context length during pretraining/fine-tuning. — Assumptions/dependencies: control over corpus curation and packing; privacy constraints may conflict.
Arithmetic hardening via lightweight synthetic curricula (Sectors: software, finance, education) — Interleave multi-token addition exercises during fine-tuning to induce geometric convergence (linearly separable mod classes) without degrading general language ability; focus on least significant digit tasks first. — Assumptions/dependencies: tied embeddings and multi-token numbers produce strongest effects; small curricula blended with domain data are often sufficient.
Output guardrails using modular checks (Sectors: finance, operations, logistics) — Add parity (mod 2), last-digit (mod 10), Luhn-like (mod 10/11) checks as fast validators on generated calculations, invoice numbers, SKU checksums, and ledger summaries; prompt or tool-correct when violations are detected. — Assumptions/dependencies: the paper shows mod-2/5/10 are easiest; these guards detect a large fraction of arithmetic slips with minimal latency.
Corpus auditing and anomaly detection via number-spectrum fingerprints (Sectors: policy, risk/compliance, software) — Track the Fourier spectrum of number-token frequencies to detect distribution drift, synthetic content mixtures, or data pipeline bugs; compare spectrum against historical baselines. — Assumptions/dependencies: requires enough numeric tokens; spikes at T=2,5,10 are typical in natural web-scale corpora—deviations can indicate issues but are not proof of malfeasance.
Evaluation of cyclical concepts beyond numbers (Sectors: scheduling/logistics, calendaring, education) — Reuse the spectral vs. geometric probing workflow to verify that models not only “see” day-of-week/month periodicities but also encode functionally separable classes; adjust prompts/formatting to explicitly include multi-token date components (DD-MM-YYYY). — Assumptions/dependencies: availability of date-time tokens and consistent formatting; generalization beyond numbers is plausible but should be validated per domain.
Structural attribution as an interpretability workflow (Sectors: academia, software/AI) — Adopt controlled data perturbations (e.g., Swap Numbers, Isolate-k, ContextLength-ℓ) to attribute learned numeric structure to data properties; use it alongside instance-level influence methods to guide dataset design. — Assumptions/dependencies: access to retraining or continued pretraining; impacts must be measured on held-out embeds/probes.

Long-Term Applications

These require further research, scaling studies, or engineering to productize but are directly motivated by the paper’s discoveries and theory.

Numeracy certification standards for AI systems (Sectors: policy/regulation, finance, healthcare) — Define standardized tests that require geometric (not just spectral) convergence for mod-T classes; mandate reporting κ and LDA-based metrics in procurement/compliance; include “Numeracy Annex” in AI assurance frameworks. — Dependencies: consensus on thresholds; sector-specific moduli (e.g., mod 10/11 for checksums) and domain-tailored tasks.
Tokenizer redesign for robust numerical reasoning (Sectors: software/AI) — Develop numeric-aware tokenizers that (a) guarantee multi-token segmentation for longer numbers, (b) preserve base-10 triads, and (c) coordinate with tied embeddings to encourage modular subproblems; ship as “Numeracy-First” tokenizers. — Dependencies: retraining or substantial continued pretraining; compatibility with multilingual/format-rich corpora.
Training objectives that explicitly shape within-class scatter (Sectors: software/AI, academia) — Add losses/regularizers that maximize Fisher discriminants (or control cond(S_W)) for target moduli, ensuring that Fourier power translates into geometric separability; extend to cyclical concepts (calendars, clocks). — Dependencies: careful optimization to avoid interfering with general capabilities; requires scalable implementation.
Plug-in “Fourier number embeddings” modules (Sectors: software/AI, finance, education) — Integrate explicit circular features (e.g., FONE-like Fourier number embeddings) into model inputs/heads to guarantee usable modular structure; expose a numeracy API for downstream apps. — Dependencies: interface and compatibility with existing architectures; evidence for broad transfer without regressions.
Certified arithmetic heads for LLM tool-use (Sectors: finance, logistics, enterprise software) — Couple LLMs with small arithmetic modules that operate on multi-token numbers using modular representations; provide formal guarantees for addition/multiplication correctness and carry handling. — Dependencies: robust routing/tool-use; verification frameworks; UX integration.
Forensic provenance and synthetic-text detection using numeric spectra (Sectors: policy, security) — Use characteristic mod-spectrum signatures of corpora/models as part of content provenance and detection pipelines; combine with other stylometry to identify training sources or generation artifacts. — Dependencies: large-scale benchmarks to validate sensitivity/specificity; defenses against adversarial obfuscation.
Curriculum design for math tutoring systems (Sectors: education) — Build tutoring workflows that present numbers in multi-token, carry-rich contexts to induce modular representations in student-facing LLMs; progressively expand context length and cross-number interactions to enhance generalization. — Dependencies: longitudinal validation of learning gains; age-appropriate formatting; alignment with educational standards.
Numeracy-aware data governance (Sectors: policy, enterprise data management) — Establish guidelines that preserve meaningful text–number co-occurrence and cross-number interaction in curated corpora; annotate datasets with numeric-spectrum metadata for reuse/disclosure. — Dependencies: processes to balance privacy with representational needs; tooling for automatic metadata extraction.
Cross-modal periodic-representation engineering (Sectors: robotics, IoT, energy) — Extend spectral–geometric design to periodic sensor/time series (e.g., diurnal cycles, maintenance intervals), ensuring models both detect periodicity and encode functionally separable phases for control/planning. — Dependencies: domain-specific tokenization/segmentation of signals; evaluation of safety-critical performance.
Theory-guided numeracy benchmarks and leaderboards (Sectors: academia, software/AI) — Publish open benchmarks that score models on both spectra and geometry (κ, λ_max(W^{-1}B)) across moduli, tokenizers, and architectures; track progress on moving from spectral spikes to functional separability. — Dependencies: community adoption; reproducible pipelines and public model embeddings.

Each application builds on the paper’s core insights: (1) Fourier spikes (spectral convergence) are widespread but insufficient indicators of functional numeracy, (2) geometric convergence depends jointly on data, architecture, optimizer, and tokenization, and (3) multi-token arithmetic induces modular structure reliably. The feasibility of each item varies with control over tokenization, access to training loops, and sector-specific requirements for auditability and safety.

View Paper Prompt View All Prompts

Glossary

AdamW: An adaptive optimizer with decoupled weight decay commonly used to train deep networks. "We vary two different optimizers for training LLMs: AdamW and Muon."
Anisotropic: Having direction-dependent variability; here, unequal spread of within-class variance across embedding dimensions. "the LSTM's within-class scatter is highly anisotropic"
Attention window: The span of tokens a model’s attention mechanism can directly attend to within a context. "no two number tokens can interact within the same attention window."
Autoregressive language modeling: Training objective where the model predicts the next token given previous tokens. "suggesting that autoregressive language modeling with text-number co-occurrence alone extracts richer modular structure than classical embedding methods."
Between-class scatter matrix: Matrix capturing variance among class means, used in discriminant analysis. "the class means μ_r, grand mean μ, between-class scatter matrix S_B, and within-class scatter matrix S_W be"
Carry propagation: The process by which carries from lower-order digit additions affect higher-order digits in multi-token addition. "multi-token tokenization creates modular subproblems that produce both spectral and geometric convergence through carry propagation"
Circular probes: Probes that project embeddings onto the unit circle to analyze circular structure. "We additionally train circular probes that project embeddings onto the unit circle"
Cohen's κ: A chance-corrected agreement measure used here to evaluate probe accuracy across T-way classes. "measured by Cohen's $\kappa$ of a classifier trained to predict $n \bmod T$ "
Condition number: Ratio of largest to smallest eigenvalues of a matrix; indicates numerical stability and anisotropy. "A large condition number, $\text{cond}(_W) = \lambda_{\max}(_W)/\lambda_{\min}(_W)$ , drastically lowers the minimum bound."
Convergent evolution: Independent emergence of similar representations across different models due to shared training constraints. "We view this universality as a case of convergent evolution"
Cross-number interaction: Statistical dependencies arising from multiple numbers co-occurring within the same context sequence. "including text-number co-occurrence and cross-number interaction"
Discrete Fourier transform: A transform that decomposes a sequence into frequency components; applied along token index to detect periodicity. "we compute the discrete Fourier transform along the token index:"
Eigenspectrum: The set of eigenvalues of a matrix; here determines separability via $S_W^{-1}S_B$ . "probe accuracy depends on the eigenspectrum of $_W^{-1}_B$"
Fisher discriminant: The maximal achievable class separation along a single linear direction in LDA. "its Fisher discriminant $\lambda_{\max}(_W^{-1}_B)$ is two orders of magnitude smaller."
Fisher's Linear Discriminant Analysis: A technique for finding linear combinations of features that best separate classes. "through the lens of Fisher's Linear Discriminant Analysis"
Fourier domain: Frequency space obtained by Fourier transforming embeddings across token indices. "period- $T$ spikes in the Fourier domain"
Fourier domain sparsity: Concentration of power at a few Fourier frequencies; indicates strong periodic components. "we prove that Fourier domain sparsity is necessary but not sufficient for mod- $T$ geometric separability."
Fourier magnitude spectrum: The power of Fourier coefficients across frequencies, often normalized to compare peaks. "the Fourier magnitude spectrum, computed as the power $\|_\nu\|^2$ normalized by the median across frequencies $\nu$ "
Fourier power: Total energy at certain frequencies (e.g., harmonics of a period), used as a measure of periodic signal strength. "The LSTM has larger Fourier power $\Phi_{10}$ than the Transformer"
Fourier spike: A pronounced peak in the Fourier magnitude at a specific periodic frequency, indicating periodic structure. "A Fourier spike at period $T$ refers to a visible peak in $\|_{1/T}\|^2$ "
Generalized eigenvalue: Eigenvalue of a matrix pair (A, B) in problems like $S_W^{-1}S_B$ ; measures discriminative signal relative to noise. "exactly the largest generalized eigenvalue of the scatter matrices, $\lambda_{\max}(_W^{-1} _B)$."
Geometric convergence: Emergence of linearly separable mod- $T$ classes in embeddings, enabling linear decoding. "We call the emergence of Fourier spikes spectral convergence and the emergence of linearly separable mod- $T$ classes geometric convergence."
Geometric separability: The property that classes can be separated by linear boundaries in embedding space. "we prove that Fourier domain sparsity is necessary but not sufficient for mod- $T$ geometric separability."
Grokking: A delayed generalization phenomenon where training accuracy precedes test accuracy by a long margin before a phase transition. "unlike the grokking observed in modular arithmetic"
Harmonic frequencies: Integer multiples of a base frequency corresponding to a given period in the Fourier domain. "Let $\Phi_T = \sum_{\ell=1}^{T-1} \|_{\ell/T}\|^2$ be the total power at harmonics of period $T$ , and $H_T = \{0, 1/T, 2/T, \ldots, (T{-}1)/T\}$ be the set of harmonic frequencies."
Linear probe: A linear classifier trained on frozen embeddings to test whether a property is linearly decodable. "we train a linear probe ( $T$ -class logistic regression) to predict $n \bmod T$ from $e(n)$ ."
Linear representation hypothesis: The conjecture that high-level concepts, if learned, are linearly decodable from representations. "The linear representation hypothesis \citep{park2023lrh} conjectures that such structure should be accessible via linear probes."
Logistic regression: A linear classification model used here as a probe for mod- $T$ decoding. "we train a linear probe ( $T$ -class logistic regression) to predict $n \bmod T$ "
Mechanistic interpretability: Research that aims to reverse-engineer learned circuits and representations in models. "Mechanistic interpretability aims to reverse-engineer the representations and algorithms learned by LLMs."
Median-normalized magnitude: A normalization of spectral power by the median across frequencies to highlight peaks. "Each row shows the median-normalized magnitude at each Fourier frequency."
Mod- $T$ : Equivalence classes of integers under modulus T arithmetic; here, classes to be decoded from embeddings. "the residue class $n \bmod T$ is linearly decodable from the embedding $e(n)$ ."
Modular arithmetic: Arithmetic system where numbers wrap around after reaching a certain modulus. "Linear probes reveal only the Transformer and Gated DeltaNet learned functional modular arithmetic with high Cohen's $\kappa$ "
Moduli: Plural of modulus; different mod- $T$ tasks considered (e.g., 2, 5, 10). "remains near chance across all moduli."
PPMI: Positive Pointwise Mutual Information; a classical word embedding method based on co-occurrence statistics. "PPMI and word2vec fall in between."
RFM kernel: A kernel based on random feature mappings used as a non-linear probe alternative. "We use three probe types: linear (logistic regression), MLP, and RFM kernel \citep{radhakrishnan2024rfm}"
Rayleigh quotient: Ratio measuring separation along a direction, used in LDA to quantify discriminability. "the single direction $v$ that maximizes the ratio of between-class to within-class variance, given by the Rayleigh quotient $\frac{v^\top _B v}{v^\top _W v}$ ."
Residue class: Set of integers congruent modulo T. "define the mod- $T$ residue classes $C_r = \{n : n \equiv r \pmod{T}\}$ "
Stratified digit-count sampling: Sampling scheme ensuring balanced coverage over operand digit lengths in addition tasks. "In 9-digit addition, operands have 1-9 digits with stratified digit-count sampling."
Text-number co-occurrence: Joint appearance of numbers with surrounding text, providing a learning signal beyond frequency. "including text-number co-occurrence and cross-number interaction"
Tied embeddings: Using the same embedding matrix for both input and output token representations. "Since all our models use tied embeddings"
Token frequency distribution: Empirical counts of tokens; here, number tokens’ marginal distribution exhibiting periodic spectra. "Even the raw token frequency distribution of numbers in the training corpus, with no model at all, exhibits the same periodic spectrum"
Tokenizer: The component that maps text to tokens; its design shapes numerical representation learning. "the tokenizer determines whether Fourier structure emerges"
Unigram Replace: A perturbation that resamples each number token independently from its marginal (unigram) distribution, destroying co-occurrences. "including Unigram Replace, which destroys all co-occurrence structure by independently resampling every number token from its marginal distribution."
Within-class scatter matrix: Matrix capturing variance within each class around its class mean. "between-class scatter matrix S_B, and within-class scatter matrix S_W be"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Summary

Convergent Evolution of Number Representations in LLMs

Overview

Universality of Spectral Convergence

Hierarchical Structure: Spectral vs. Geometric Convergence

Attribution to Data, Architecture, and Optimization

Role of Task Structure and Tokenization

Training Dynamics

Probing and Visualization of Learned Representations

Theoretical Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the authors want to answer?

How did they study this?

What is a “number embedding”?

What is a “Fourier spike” and “period T”?

What does “mod T” mean?

What is a “linear probe”?

What experiments did they run?

What did they find, and why is it important?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets