Studying number theory with deep learning: a case study with the Möbius and squarefree indicator functions (2502.10335v1)

Published 14 Feb 2025 in math.NT and cs.LG

Abstract: Building on work of Charton, we train small transformer models to calculate the M\"obius function $\mu(n)$ and the squarefree indicator function $\mu^2(n)$. The models attain nontrivial predictive power. We then iteratively train additional models to understand how the model functions, ultimately finding a theoretical explanation.

Summary

The paper shows that small transformer models can learn to predict the Möbius and squarefree indicator functions, achieving over 70% accuracy for squarefree predictions.
The methodology employs Chinese Remainder Theorem encodings and rapid model refinement to outperform trivial strategies within merely 2 epochs.
The study provides a theoretical explanation linking model performance to number theory and introduces the Möbius Challenge for future research.

This paper explores the capability of small transformer models to predict the Möbius function $\mu(n)$ and the squarefree indicator function $\mu^2(n)$ . The author trains models based on the architecture of Charton [charton2024learning] and then iteratively refines these models to understand their functionality, ultimately deriving a theoretical explanation for their behavior.

The paper begins by acknowledging the success of deep learning, particularly transformer-based models, in various fields, including pure mathematics. However, it notes that applying deep learning to concrete numerical calculations has been less successful. The paper investigates whether small transformers can learn functions that are known to be difficult to compute, such as the Möbius function and the squarefree indicator function.

The Möbius function is defined as:

$\mu(n) = \begin{cases} 1 & \text{if } n=1, \ 0 & \text{if } n \text{ has a squared prime factor}, \ (-1)<sup>k</sup> & \text{if } n = p_1 p_2 \dots p_k \text{ where } p_i \text{ are distinct primes}. \end{cases}$

where:

$n$ is a positive integer
$p_i$ are distinct prime numbers
$k$ is the number of distinct prime factors

The squarefree indicator function, $\mu^2(n)$ , is $0$ if $n$ is divisible by a nontrivial square (squarefull) and $1$ otherwise.

The paper notes that the obvious algorithm to compute $\mu(n)$ and $\mu^2(n)$ involves factoring $n$ , and no known algorithm performs significantly better. It mentions the "Möbius Randomness" conjecture, which suggests that $\mu(n)$ doesn't strongly correlate with any function computable in polynomial time.

The author uses a Chinese Remainder Theorem (CRT) representation of integers to obfuscate square divisibility patterns. Integers $n$ are sampled uniformly randomly between $2$ and $10^{13}$ and represented as a sequence of $n \bmod p_j$ for the first $100$ primes $p_j$ . The author encodes each residue in the CRT representation as the pair $(n \bmod p_j, p_j)$ , where the two integers are represented as sequences of digits in base $1000$ using the sign `+' as a separator. The transformers are trained to output tokens representing either $\mu(n)$ or $\mu^2(n)$ by minimizing cross-entropy.

The results of the CRT experiments show that the models quickly outperform trivial strategies. Within $2$ epochs, models trained on CRT representations of $(n, \mu^2(n))$ correctly predict $70.64\%$ of examples, compared to the default strategy's $60.79\%$ . Models trained on CRT representations of $(n, \mu(n))$ achieve $50.98\%$ accuracy within the first $2$ epochs. Further experiments restricted to squarefree integers reveal that the predictive power of the models lies in their ability to predict when $n$ is divisible by a square.

The paper explores several false starts in explaining the models' success, including the possibility that the models are learning some aspect of the Chinese Remainder Theorem (CRT). However, experiments using variations of the CRT representation of $n$ and transformers of similar architecture fail to support this hypothesis. For example, models fail to predict $n$ given $(n \bmod p_j)$ for the first $10$ primes $p_j$ , or to learn the indicator function for an interval given a CRT representation.

Limited input experiments reveal that using only $(n \bmod 2, n \bmod 3)$ to predict $\mu^2(n)$ gives a model with $70.1\%$ accuracy. This suggests that knowing $n \bmod 6$ is enough to guess $\mu^2(n)$ with about $70\%$ probability.

The paper provides a theoretical explanation for the models' behavior. The model learns that numbers that are not squarefree are probably divisible by a small square. The paper constructs Dirichlet series for relevant subsets of natural numbers and computes their residues in terms of the Riemann zeta function $\zeta(s)$ . Using only divisibility by the first $25$ primes leads to a strategy to approximate $\mu^2(n)$ with accuracy $70.34\%$ .

The paper concludes that previous neural network experiments trained to predict $\mu^2(n)$ from $n$ appear to learn the is-divisible-by-a-small-prime-square function, whereas small transformer models trained on CRT representations of $n$ to predict $\mu^2(n)$ appear to learn the is-divisible-by-a-small-prime function. The author notes that small transformers like Int2Int are rapidly trainable and useful as one-sided oracles to determine whether inputs contain enough information to evaluate an output. The paper ends with a "Möbius Challenge": to train a ML model with inputs computable in time $\ll \log^A(n)$ for some finite $A$ that distinguishes between $\mu(n) = 1$ and $\mu(n) = -1$ with probability greater than $51\%$ .