Learning the greatest common divisor: explaining transformer predictions

Published 29 Aug 2023 in cs.LG and cs.AI | (2308.15594v2)

Abstract: The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list $\mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $\mathcal D$ that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ GCD $\leq100$). Log-uniform operands boost performance to $73$ GCD $\leq 100$, and a log-uniform distribution of outcomes (i.e. GCD) to $91$. However, training from uniform (balanced) GCD breaks explainability.

Abstract PDF Upgrade to Chat

Authors (1)

François Charton

Citations (16)

View on Semantic Scholar

Summary

The paper demonstrates that small transformers learn the GCD task with deterministic predictions, enabling clear categorization of operand pairs based on shared divisibility.
The paper shows that training data distribution and encoding base choice critically influence the diversity and accuracy of the predicted GCD values.
The paper observes a grokking phenomenon in large composite bases, revealing that extended training can uncover more intricate arithmetic relationships.

An Analysis of Transformer Predictions for the Greatest Common Divisor Task

The paper investigates the capability of small transformers in performing the arithmetic task of calculating the greatest common divisor (GCD) of two positive integers. The study reveals significant insights into how transformers learn and predicts this function, emphasizing the role of the input distribution and representation base on model performance and explainability.

Key Findings

Deterministic Predictions of GCDs: The study found that transformers categorize input pairs with the same GCD and produce a consistent prediction for each class. This determinism persists throughout the learning process as long as the GCDs involved are products of small primes or divisors of the encoding base, resulting in a highly explainable model behavior.
Role of Input and Output Distributions: The model's ability to learn GCD reliably is significantly influenced by the distribution of training data. Training with uniformly sampled operand pairs restricted the model's learning to a limited number of GCD values (only up to 38 for GCDs under 100). However, when operands are sampled from a log-uniform distribution or when outcomes follow a log-uniform distribution, the transformers can learn up to 91 different GCD values under 100. Uniform distributions of outcomes disrupted the explainability, suggesting that input-output distributions must be carefully configured to maintain both performance and transparency in predictions.
Grokking Phenomenon with Large Bases: In encoding bases that are large composites, a phenomenon akin to "grokking" was observed, where the learning of new non-divisors of the base occurred after prolonged training. This shows the potential of transformers to eventually grasp more intricate arithmetic relationships beyond initial learning, expanding their predictive capabilities over time.
Base Size and Composition: The choice of base for integer representation is critical. Composite number bases, especially those abundant in small prime factors, enable the model to predict more GCDs accurately. This is because such bases facilitate simpler divisibility rules that the transformers can learn and apply effectively.
Training Dynamics and Explainability: The paper elucidates a systematic approach to characterizing and explaining transformer predictions by experimenting with various training settings. This methodology could inspire similar approaches for other arithmetic tasks, emphasizing the importance of understanding model behavior through experimental manipulation rather than purely analytical means.

Implications and Future Directions

The findings from this research have multiple implications. Primarily, they demonstrate a path toward enhancing the interpretability of machine learning models by leveraging mathematical properties inherent in the data representation and distribution. The observed dependency on training distributions also provides a strategic lever to enhance model learning and robustness, which could be beneficial for other arithmetic tasks or even broader applications in machine learning such as fine-tuning LLMs.

Future research might explore further tuning of training distributions, perhaps applying curriculum learning strategies without succumbing to catastrophic forgetting. Additionally, the phenomenon of grokking could be further dissected to understand its triggers and applications, potentially leveraging it for more complex arithmetic or algorithmic learning tasks.

While the study refrains from exploring Euclid's algorithm directly, its framework opens opportunities to contrast machine-learned arithmetic behaviors with traditional algorithmic approaches, offering insights into bridging algorithmic efficiency with data-driven learning models in AI.

Markdown Report Issue