Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 41 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 124 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Studying number theory with deep learning: a case study with the Möbius and squarefree indicator functions (2502.10335v1)

Published 14 Feb 2025 in math.NT and cs.LG

Abstract: Building on work of Charton, we train small transformer models to calculate the M\"obius function $\mu(n)$ and the squarefree indicator function $\mu2(n)$. The models attain nontrivial predictive power. We then iteratively train additional models to understand how the model functions, ultimately finding a theoretical explanation.

Summary

  • The paper shows that small transformer models can learn to predict the Möbius and squarefree indicator functions, achieving over 70% accuracy for squarefree predictions.
  • The methodology employs Chinese Remainder Theorem encodings and rapid model refinement to outperform trivial strategies within merely 2 epochs.
  • The study provides a theoretical explanation linking model performance to number theory and introduces the Möbius Challenge for future research.

This paper explores the capability of small transformer models to predict the Möbius function μ(n)\mu(n) and the squarefree indicator function μ2(n)\mu^2(n). The author trains models based on the architecture of Charton [charton2024learning] and then iteratively refines these models to understand their functionality, ultimately deriving a theoretical explanation for their behavior.

The paper begins by acknowledging the success of deep learning, particularly transformer-based models, in various fields, including pure mathematics. However, it notes that applying deep learning to concrete numerical calculations has been less successful. The paper investigates whether small transformers can learn functions that are known to be difficult to compute, such as the Möbius function and the squarefree indicator function.

The Möbius function is defined as:

μ(n)={1amp;if n=1, 0amp;if n has a squared prime factor, (1)<sup>k</sup>amp;if n=p1p2pk where pi are distinct primes.\mu(n) = \begin{cases} 1 &amp; \text{if } n=1, \ 0 &amp; \text{if } n \text{ has a squared prime factor}, \ (-1)<sup>k</sup> &amp; \text{if } n = p_1 p_2 \dots p_k \text{ where } p_i \text{ are distinct primes}. \end{cases}

where:

  • nn is a positive integer
  • pip_i are distinct prime numbers
  • kk is the number of distinct prime factors

The squarefree indicator function, μ2(n)\mu^2(n), is $0$ if nn is divisible by a nontrivial square (squarefull) and $1$ otherwise.

The paper notes that the obvious algorithm to compute μ(n)\mu(n) and μ2(n)\mu^2(n) involves factoring nn, and no known algorithm performs significantly better. It mentions the "Möbius Randomness" conjecture, which suggests that μ(n)\mu(n) doesn't strongly correlate with any function computable in polynomial time.

The author uses a Chinese Remainder Theorem (CRT) representation of integers to obfuscate square divisibility patterns. Integers nn are sampled uniformly randomly between $2$ and 101310^{13} and represented as a sequence of nmodpjn \bmod p_j for the first $100$ primes pjp_j. The author encodes each residue in the CRT representation as the pair (nmodpj,pj)(n \bmod p_j, p_j), where the two integers are represented as sequences of digits in base $1000$ using the sign `+' as a separator. The transformers are trained to output tokens representing either μ(n)\mu(n) or μ2(n)\mu^2(n) by minimizing cross-entropy.

The results of the CRT experiments show that the models quickly outperform trivial strategies. Within $2$ epochs, models trained on CRT representations of (n,μ2(n))(n, \mu^2(n)) correctly predict 70.64%70.64\% of examples, compared to the default strategy's 60.79%60.79\%. Models trained on CRT representations of (n,μ(n))(n, \mu(n)) achieve 50.98%50.98\% accuracy within the first $2$ epochs. Further experiments restricted to squarefree integers reveal that the predictive power of the models lies in their ability to predict when nn is divisible by a square.

The paper explores several false starts in explaining the models' success, including the possibility that the models are learning some aspect of the Chinese Remainder Theorem (CRT). However, experiments using variations of the CRT representation of nn and transformers of similar architecture fail to support this hypothesis. For example, models fail to predict nn given (nmodpj)(n \bmod p_j) for the first $10$ primes pjp_j, or to learn the indicator function for an interval given a CRT representation.

Limited input experiments reveal that using only (nmod2,nmod3)(n \bmod 2, n \bmod 3) to predict μ2(n)\mu^2(n) gives a model with 70.1%70.1\% accuracy. This suggests that knowing nmod6n \bmod 6 is enough to guess μ2(n)\mu^2(n) with about 70%70\% probability.

The paper provides a theoretical explanation for the models' behavior. The model learns that numbers that are not squarefree are probably divisible by a small square. The paper constructs Dirichlet series for relevant subsets of natural numbers and computes their residues in terms of the Riemann zeta function ζ(s)\zeta(s). Using only divisibility by the first $25$ primes leads to a strategy to approximate μ2(n)\mu^2(n) with accuracy 70.34%70.34\%.

The paper concludes that previous neural network experiments trained to predict μ2(n)\mu^2(n) from nn appear to learn the is-divisible-by-a-small-prime-square function, whereas small transformer models trained on CRT representations of nn to predict μ2(n)\mu^2(n) appear to learn the is-divisible-by-a-small-prime function. The author notes that small transformers like Int2Int are rapidly trainable and useful as one-sided oracles to determine whether inputs contain enough information to evaluate an output. The paper ends with a "Möbius Challenge": to train a ML model with inputs computable in time logA(n)\ll \log^A(n) for some finite AA that distinguishes between μ(n)=1\mu(n) = 1 and μ(n)=1\mu(n) = -1 with probability greater than 51%51\%.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 15 likes.

Upgrade to Pro to view all of the tweets about this paper:

Reddit Logo Streamline Icon: https://streamlinehq.com