- The paper shows that small transformer models can learn to predict the Möbius and squarefree indicator functions, achieving over 70% accuracy for squarefree predictions.
- The methodology employs Chinese Remainder Theorem encodings and rapid model refinement to outperform trivial strategies within merely 2 epochs.
- The study provides a theoretical explanation linking model performance to number theory and introduces the Möbius Challenge for future research.
This paper explores the capability of small transformer models to predict the Möbius function μ(n) and the squarefree indicator function μ2(n). The author trains models based on the architecture of Charton [charton2024learning] and then iteratively refines these models to understand their functionality, ultimately deriving a theoretical explanation for their behavior.
The paper begins by acknowledging the success of deep learning, particularly transformer-based models, in various fields, including pure mathematics. However, it notes that applying deep learning to concrete numerical calculations has been less successful. The paper investigates whether small transformers can learn functions that are known to be difficult to compute, such as the Möbius function and the squarefree indicator function.
The Möbius function is defined as:
μ(n)={1amp;if n=1, 0amp;if n has a squared prime factor, (−1)<sup>k</sup>amp;if n=p1p2…pk where pi are distinct primes.
where:
- n is a positive integer
- pi are distinct prime numbers
- k is the number of distinct prime factors
The squarefree indicator function, μ2(n), is $0$ if n is divisible by a nontrivial square (squarefull) and $1$ otherwise.
The paper notes that the obvious algorithm to compute μ(n) and μ2(n) involves factoring n, and no known algorithm performs significantly better. It mentions the "Möbius Randomness" conjecture, which suggests that μ(n) doesn't strongly correlate with any function computable in polynomial time.
The author uses a Chinese Remainder Theorem (CRT) representation of integers to obfuscate square divisibility patterns. Integers n are sampled uniformly randomly between $2$ and 1013 and represented as a sequence of nmodpj for the first $100$ primes pj. The author encodes each residue in the CRT representation as the pair (nmodpj,pj), where the two integers are represented as sequences of digits in base $1000$ using the sign `+' as a separator. The transformers are trained to output tokens representing either μ(n) or μ2(n) by minimizing cross-entropy.
The results of the CRT experiments show that the models quickly outperform trivial strategies. Within $2$ epochs, models trained on CRT representations of (n,μ2(n)) correctly predict 70.64% of examples, compared to the default strategy's 60.79%. Models trained on CRT representations of (n,μ(n)) achieve 50.98% accuracy within the first $2$ epochs. Further experiments restricted to squarefree integers reveal that the predictive power of the models lies in their ability to predict when n is divisible by a square.
The paper explores several false starts in explaining the models' success, including the possibility that the models are learning some aspect of the Chinese Remainder Theorem (CRT). However, experiments using variations of the CRT representation of n and transformers of similar architecture fail to support this hypothesis. For example, models fail to predict n given (nmodpj) for the first $10$ primes pj, or to learn the indicator function for an interval given a CRT representation.
Limited input experiments reveal that using only (nmod2,nmod3) to predict μ2(n) gives a model with 70.1% accuracy. This suggests that knowing nmod6 is enough to guess μ2(n) with about 70% probability.
The paper provides a theoretical explanation for the models' behavior. The model learns that numbers that are not squarefree are probably divisible by a small square. The paper constructs Dirichlet series for relevant subsets of natural numbers and computes their residues in terms of the Riemann zeta function ζ(s). Using only divisibility by the first $25$ primes leads to a strategy to approximate μ2(n) with accuracy 70.34%.
The paper concludes that previous neural network experiments trained to predict μ2(n) from n appear to learn the is-divisible-by-a-small-prime-square function, whereas small transformer models trained on CRT representations of n to predict μ2(n) appear to learn the is-divisible-by-a-small-prime function. The author notes that small transformers like Int2Int are rapidly trainable and useful as one-sided oracles to determine whether inputs contain enough information to evaluate an output. The paper ends with a "Möbius Challenge": to train a ML model with inputs computable in time ≪logA(n) for some finite A that distinguishes between μ(n)=1 and μ(n)=−1 with probability greater than 51%.