AutoML-Zero: Evolving Algorithms from Primitives
- AutoML-Zero is a framework that automates the discovery of ML algorithms from basic math primitives using evolutionary strategies, offering unbiased algorithm design.
- It employs regularized evolution to iteratively modify low-level programs and effectively rediscover established methods like backpropagation and dropout-like noise.
- The approach showcases minimal human bias and emergent algorithmic motifs while facing challenges like search sparsity and reduced interpretability of the evolved code.
AutoML-Zero is a framework for the automated discovery of complete machine learning algorithms from a search space consisting solely of basic mathematical operations. Eschewing the conventional approach of relying on human-specified neural architectures or expert-designed layers, AutoML-Zero leverages a highly generic and unbiased search space in conjunction with evolutionary strategies to produce learning algorithms directly from first principles. It demonstrates that, with the proper evolutionary infrastructure, it is possible to rediscover standard techniques such as backpropagation and develop advances such as bilinear interactions, gradient normalization, weight averaging, and dropout-like behaviors, purely through optimization over primitive mathematical constructs and stochastic search (Real et al., 2020).
1. Generic Search Space and Virtual Machine Specification
AutoML-Zero models machine learning algorithms as three interacting, low-level programs: Setup, Predict, and Learn. These interact through a virtual machine comprising three memory types:
- Scalars ()
- Vectors (; each length , the feature dimension)
- Matrices (; each )
All memory is floating-point and zero-initialized at the start of training. Algorithms are evolved as sequences of instructions (genotypes), each instruction being a tuple of (opcode, argument addresses, optional constants), implementing one of approximately 65 operations. The atomic instruction set is strictly confined to "high-school math" primitives, entirely excluding explicit ML constructs. It covers:
- Scalar arithmetic, nonlinearities (e.g., , )
- Vector operations (e.g., elementwise addition, ReLU, )
- Matrix-vector and matrix-matrix ops, including multiplication, elementwise operations
- Reductions (e.g., , )
- Random initializers (, )
Execution comprises an initial Setup, followed by a training loop interleaving Predict and Learn on training samples (single-pass stochastic training), and a validation loop calling only Predict. Outputs are normalized (sigmoid/softmax) and scored by cross-entropy or MSE, as appropriate.
2. Evolutionary Search Paradigm
AutoML-Zero employs "regularized evolution," a population-based procedure with the following steps:
- Population management: Maintain a population of candidate algorithms, each constituted by the triple (Setup, Predict, Learn).
- Tournament selection: Sample individuals, select the parent with minimal loss.
- Variation: Child is produced by mutating the parent's code through random insertion, deletion, argument modification, or program randomization.
- Fitness evaluation: Fitness is defined as the median validation loss (or accuracy) across proxy tasks.
- Replacement: Remove the oldest individual, add the new child.
Table: Evolutionary Algorithm Workflow
| Step | Description | Implementation Note |
|---|---|---|
| Population Update | Remove oldest, append new candidate | Maintains diversity, limits stagnation |
| Tournament Select | Draw , select lowest loss parent | Tournament size controls selection |
| Variation | Random insertion, deletion, mutation of code | Promotes exploration |
| Evaluation | Median loss across proxy tasks | Handles overfitting, evaluates generality |
Efficiency enhancements include an LRU cache for behavioral fingerprinting (avoids redundant computation), "hurdles" (early stopping for clearly poor candidates), and multi-worker migration (Real et al., 2020).
3. Evolved Algorithmic Motifs
3.1 Rediscovery of Backpropagation
AutoML-Zero successfully produces standard two-layer neural networks trained by gradient descent from first principles. In a regression setting with a 2-layer ReLU teacher, evolution converges on canonical backpropagation: setup initializes weights and learning rates, predict computes forward passes, and learn applies error-driven weight updates matching classic gradient descent.
3.2 Emergence of Modern Techniques
On complex tasks, such as CIFAR-10 proxies, evolution led to algorithms incorporating:
- Noise injection: Random perturbations to input features, analogous to dropout, emerged under low-data regimes.
- Bilinear interactions: The top algorithm predicted outputs via with , and a random vector, introducing multiplicative structure absent from linear models.
- Normalized gradients: Parameter updates used normalization , acting as automatic gradient clipping.
- Weight averaging: During training, models accumulated weights , with used for validation (resembling Polyak averaging).
3.3 Adaptations
Further evolutionary runs seeded with previous solutions yielded:
- "Noisy ReLU" as a regularizer (frequent when data is scarce)
- Learning rate decay via repeated arctan transformations (mimicking exponential decay)
- Task-specific learning rate modulation (e.g., of current parameter mean for multiclass tasks)
4. Experimental Performance and Emergent Phenomena
Empirical assessment revealed:
- Sparse feasibility: For regression, in random programs outperforms linear regression; nonlinearity increases sparsity further.
- Search efficacy: Evolution greatly outperforms random search; for nonlinear tasks, random search discovers no high-quality algorithms.
- Emergent components: Modern ML features—backpropagation, bilinear forms, gradient norm clipping, dropout-like noise, Polyak averaging—appear via evolution without explicit human input.
- Resulting accuracies: On binary CIFAR-10, the best evolved algorithm (with hyperparameter retuning) achieved accuracy, compared to for logistic regression and for two-layer MLP+GD. Cross-dataset transfer (untuned) also outperformed baselines (SVHN: ; Downsampled ImageNet: ; Fashion MNIST: ).
Adaptation statistics confirm the interpretability of evolutionary pressure under different conditions. For instance, "Noisy ReLU" emerges in $8/30$ low-data runs (). Learning-rate decay is present in all fast-convergence environments ($30/30$) versus $3/30$ control (). The "norm-trick" adaptation for multiclass appears in $24/30$ versus $0/30$ control () (Real et al., 2020).
5. Methodological Strengths and Constraints
Strengths:
- Minimal human bias: The search space is constructed from only elementary mathematical operations.
- Full algorithmic discovery: Capable of identifying not just architectures, but learning rules and update mechanisms ab initio.
- Novel operation composition: Evolved solutions generate unorthodox yet effective algorithmic motifs (e.g., bilinear+weight-averaging+gradient-normalization).
- Task adaptation: Evolution automatically adjusts algorithms to new problem characteristics.
Limitations:
- Search sparsity: The fraction of algorithms that outperform simple baselines is extremely low, rendering the process computationally demanding.
- Restricted expressivity: Absence of control flow (no loops, no function calls, no higher-order tensor operations) prevents discovery of common deep learning components such as convolution, batch normalization, or RNNs.
- Hyperparameter coupling: Parameters such as learning rates are often tied to in-algorithm statistics, complicating transfer and requiring post-hoc decoupling.
- Code interpretability: Evolved programs are nontrivial to read and analyze; extracting generalized motifs necessitates additional ablation and analysis.
6. Prospects for Extension
Prominent avenues for future development include:
- Expanded search spaces: Incorporation of control flow (loops, function calls), batch operations, and high-level primitives (e.g., convolution).
- Enhanced search algorithms: Exploration of crossover, genetic programming techniques, reinforcement learning, or Bayesian optimization within this primitive-centric space.
- Automated simplification: Tools for hyperparameter decoupling and program distillation to improve transferability and interpretability.
- Self-reflexivity: Algorithms capable of modifying their own learning programs during optimization.
- Scaling: Application to larger and more complex datasets, and study of multi-task and continual learning regimes (Real et al., 2020).
In sum, AutoML-Zero empirically demonstrates that algorithmic discovery can occur directly from unbiased mathematical primitives using scalable evolutionary strategies. While the search remains computationally intensive and expressivity is currently bounded by the instruction set, this framework provides evidence supporting the feasibility of machine-driven progress in the automation of machine learning algorithm design from scratch.