AmoebaNet: Evolved Neural Architecture Search
- AmoebaNet is a family of evolved image classifiers that use a cell-based search space and aging evolution to automatically optimize architecture design.
- The models feature normal and reduction cells configured with operations like separable convolutions and poolings to construct robust computation graphs.
- Aging evolution in AmoebaNet promotes exploration and regularization, leading to architectures that rival or outperform hand-crafted designs on benchmarks like CIFAR-10 and ImageNet.
AmoebaNet refers to a family of image classifier architectures automatically discovered through a regularized (aging) evolutionary algorithm in a NASNet-style search space. AmoebaNet-A, in particular, was the first evolved neural architecture to surpass hand-crafted designs on large-scale image classification benchmarks, demonstrating state-of-the-art performance on ImageNet with superior or competitive efficiency relative to architectures discovered using reinforcement learning-guided neural architecture search (Real et al., 2018).
1. Architecture Search Space and Cell Composition
The search was conducted in the cell-based search space introduced for NASNet, where an image classifier comprises a small input stem, three stacks of identical "normal" cells, two reduction cells that halve the spatial resolution (placed after the first and second stacks), and a final pooling/softmax head.
In this framework, each cell is a directed acyclic graph carrying out exactly five "pairwise combinations." Each combination selects two hidden states (with replacement), applies to each an operation from the set
- identity,
- separable convolution (3×3, 5×5, 7×7),
- average pooling 3×3,
- max pooling 3×3,
- dilated separable convolution 3×3,
- 1×7→7×1 convolution, then sums the outputs to form a new hidden state.
After the five steps, there are 2 (initial inputs) + 5 (new states) = 7 hidden states; those that are never used as inputs to later combinations (typically two) are concatenated depth-wise to yield the cell's output. Normal cells preserve spatial dimensions, while reduction cells perform the same steps with stride 2.
2. Regularized (Aging) Evolution Algorithm
Aging evolution maintains a population of trained models, each with an implicit age (number of cycles survived). At each evolutionary cycle:
- individuals are uniformly randomly sampled (with replacement) from the population,
- The one with highest validation accuracy (the "tournament winner") is selected,
- A child architecture is produced by mutating the winner,
- The child is trained from scratch, its validation accuracy is measured and its age set to zero,
- The child is inserted into the population and the oldest individual (largest age) is removed,
- The ages of all remaining individuals are incremented by one.
This age-based culling prevents the population from being dominated by singular lucky models, enforcing continual exploration and implicit regularization by biasing toward architectures that retrain well, not just those with transient high validation scores. Standard non-aging evolution, in contrast, removes the lowest-accuracy model among the sampled set at each tournament.
The evolution is managed asynchronously, supporting large GPU or TPU clusters with minimal resource idle time, and controlled only by two meta-parameters: population size and tournament sample size .
3. Discovered Cell Topologies: AmoebaNet-A
AmoebaNet-A was identified by running 20,000 search-phase models (CIFAR-10; , ), selecting the single best as the discovered architecture. The details of the normal and reduction cells are as follows:
AmoebaNet-A Normal Cell:
| Step | Operation 1 | From | Operation 2 | From |
|---|---|---|---|---|
| 1 | separable conv 3×3 | h₀ | avg pool 3×3 | h₁ |
| 2 | separable conv 5×5 | h₁ | separable conv 7×7 | h₀ |
| 3 | max pool 3×3 | h₀ | separable conv 3×3 | h₂ |
| 4 | separable conv 7×7 | h₂ | separable conv 5×5 | h₃ |
| 5 | separable conv 3×3 | h₁ | max pool 3×3 | h₄ |
Unused hidden states are concatenated as the cell output.
AmoebaNet-A Reduction Cell:
| Step | Operation 1 | From | Operation 2 | From |
|---|---|---|---|---|
| 1 | separable conv 5×5 | h₀ | separable conv 3×3 | h₁ |
| 2 | avg pool 3×3 | h₂ | separable conv 7×7 | h₀ |
| 3 | separable conv 3×3 | h₃ | avg pool 3×3 | h₂ |
| 4 | max pool 3×3 | h₁ | separable conv 5×5 | h₃ |
| 5 | separable conv 7×7 | h₄ | separable conv 3×3 | h₀ |
Unused hidden states are concatenated for the output.
A compact encoding for either cell is a tuple of five two-operation steps:
4. Hyper-parameters and Training Protocols
Search Phase
- , , identity (no-op) mutation probability 0.05, otherwise equal probability for hidden-state vs. op mutation.
- Hidden-state mutation: pick a cell and one operand in one of five combinations, rewire its source; op mutation: swap the operation for a random one in the 8-op set.
- Model size: , .
- Each architecture is trained for 25 epochs on CIFAR-10 ($45,000$ train, $5,000$ validation), batch size 128, SGD momentum 0.9, learning rate schedule comparable to RL baselines.
- Search utilized 450 K40 GPUs in parallel for 20,000 models in 7 days; total GPU-hours.
Final Training
- After selection, architectures are scaled: larger (, ), retrained with longer schedules and regularization (ScheduledDropPath ; auxiliary softmax [weight 0.5 on CIFAR-10, 0.4 on ImageNet]; augmentations such as Cutout, AutoAugment).
- CIFAR-10 final: , (or ), SGD momentum 0.9, weight decay , initial with cosine decay, 600 epochs, batch 128.
- ImageNet final models: Medium: , (86.7M params); Large: , (469M params); optimizer: distributed RMSProp, decay 0.9, , weight decay , initial (decay 0.97 every 2 epochs), label smoothing 0.1, batch size 1024, 350 epochs, 100 P100 GPUs.
5. Empirical Performance and Cost
CIFAR-10 (with augmentation):
| Model | Params | Test Error (%) |
|---|---|---|
| NASNet-A | 3.3M | 3.41 |
| AmoebaNet-A (6×32) | 2.6M | 3.40±0.08 |
| AmoebaNet-A (6×36) | 3.2M | 3.34±0.06 |
ImageNet (single-crop):
| Model | Params | FLOPs | Top-1 / Top-5 (%) |
|---|---|---|---|
| Inception-ResNet-V2 | 55.8M | 13.2B | 80.4 / 95.3 |
| ResNeXt-101 (64×4d) | 83.6M | 31.5B | 80.9 / 95.6 |
| PolyNet | 92.0M | 34.7B | 81.3 / 95.8 |
| NASNet-A (RL) | 88.9M | 23.8B | 82.7 / 96.2 |
| PNASNet-5 | 86.1M | 25.0B | 82.9 / 96.2 |
| AmoebaNet-A (6×190) | 86.7M | 23.1B | 82.8 / 96.1 |
| AmoebaNet-A (6×448) | 469M | 104B | 83.9 / 96.6 |
Search compute cost: 20,000 model training jobs, each for 25 epochs (one K40 GPU) total 75,600 GPU-hours. Evolution reached 50% of its final accuracy in roughly half the time of an RL-based NAS controller, indicating efficiency especially when compute is constrained.
6. Analysis and Extensions
Aging evolution regularizes the evolutionary process by biasing toward candidate architectures that perform robustly upon retraining. By eliminating the oldest model at each cycle, architectures must repeatedly prove their accuracy, reducing the risk that the population is dominated by overfitted or lucky candidates.
This approach maintains exploratory diversity while rapidly exploiting high-accuracy models due to tournament selection. The asynchronous, steady-state loop is well-suited for distributed computing environments and requires only two meta-parameters ( and ), simplifying meta-optimization compared to reinforcement learning-based methods.
Applicability was tested primarily in NASNet-style search spaces and image classification. Small-scale tests on MNIST, grayscale CIFAR-10, and a miniature ImageNet also favored aging evolution. However, efficacy on tasks outside image classification (e.g., NLP, detection) or in more expansive search spaces was not established. Aging evolution was not combined with progressive model-size training or predictor-guided search, both of which could plausibly enhance search efficiency.
Understanding which cell motifs, such as high fan-in, correlate with higher final accuracy was identified as a significant future avenue.
7. Summary and Outlook
AmoebaNet-A demonstrated that regularized evolution is a simple and effective approach for neural architecture search, generating models that matched or exceeded the accuracy, wall-clock efficiency, and scalability of RL-guided NAS models. The mechanism, search space, and training protocol developed in (Real et al., 2018) set a new precedent for leveraging evolutionary algorithms in automatic architecture discovery, offering a method with minimal controller overhead and favorable parallelization properties. Further validation on non-vision tasks and integration with other NAS accelerants remain open directions.