ScheduledDropPath: Adaptive NASNet Regularization
- ScheduledDropPath is a stochastic regularization technique for NASNet, dynamically increasing drop rates during training to stabilize feature learning.
- It employs a linear schedule from p_min=0 to p_max, with a scaling factor that preserves activation magnitudes, ensuring appropriate regularization strength.
- Empirical results on CIFAR-10 and ImageNet demonstrate that ScheduledDropPath outperforms fixed-rate DropPath, leading to significant performance gains.
ScheduledDropPath is a stochastic regularization technique designed to improve generalization in neural architectures with multi-branch cells, particularly those developed via neural architecture search such as NASNet. It extends the fixed-rate DropPath approach by modulating the probability of dropping computational paths as a function of training progress, thus addressing key deficiencies in traditional stochastic path regularization for deep, over-parameterized, multi-branch structures (Zoph et al., 2017).
1. Motivation for ScheduledDropPath
Standard DropPath regularization independently drops each computational branch in a multi-branch cell with a fixed probability . Early empirical observations in NASNet training revealed that a constant drop rate was inadequate: small values led to insufficient regularization, while large values disrupted signal propagation during the critical early phases of feature learning. ScheduledDropPath was developed to provide gentle regularization during initial training when low-level filters are forming and progressively stronger regularization as the network's representational motifs mature. This temporal adaptation is essential for stabilizing learning and enhancing generalization in NASNet-style computational graphs (Zoph et al., 2017).
2. Mathematical Formulation
Let denote the total number of training epochs and indicate the current epoch or suitably normalized training index. Two hyperparameters govern the schedule:
- : Initial drop probability at the start of training ().
- : Maximum drop probability at the end of training ().
The drop probability is scheduled via a linear ramp: For each path at training epoch , the branch output is replaced by
The scaling factor preserves the expected activation magnitude, following the principle of inverted dropout.
3. Scheduling Strategy and Hyperparameters
ScheduledDropPath employs a zero-initialized ramp, setting and linearly increasing to at epoch . Published NASNet experiments report the following settings:
- CIFAR-10: , ,
- ImageNet: , ,
A sweep over showed yielded the optimal trade-off between regularization strength and gradient flow for NASNet architectures.
4. Implementation in NASNet Cells
ScheduledDropPath is applied to every distinct branch in a NASNet cell during the forward pass, using the same schedule . The procedure operates as follows:
- For each input branch, draw an independent Bernoulli mask with success probability $1-p(t)$.
- If the branch is dropped, output zero; otherwise, scale activations by $1/(1-p(t))$.
- No dropping occurs at test time.
NASCellBlock Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
function NASCellBlock(h_a, h_b, t): x_a ← op_a(h_a) x_b ← op_b(h_b) p ← p_max * (t / E) # linear ramp schedule keep_mask_a ← Bernoulli(1−p) keep_mask_b ← Bernoulli(1−p) if keep_mask_a == 0: x_a ← 0 else: x_a ← x_a / (1−p) if keep_mask_b == 0: x_b ← 0 else: x_b ← x_b / (1−p) h_out ← combine_fn(x_a, x_b) return h_out |
5. Comparison to Fixed-Rate DropPath
DropPath with a fixed probability regularizes all training phases equally. If is set low, NASNet’s over-capacity is insufficiently constrained; if is high, sensitivity to dropped paths during early training can severely impair feature acquisition. ScheduledDropPath’s zero-to-high ramp allows networks to learn robust low-level filters initially, applying strong regularization only as higher-level representations solidify. Empirically, fixed-rate DropPath yields only marginal improvements or, if poorly tuned, degrades performance. ScheduledDropPath consistently delivers marked generalization gains in experiments across CIFAR-10 and ImageNet (Zoph et al., 2017).
6. Empirical Performance and Ablations
Empirical validation in (Zoph et al., 2017) demonstrates the impact of ScheduledDropPath:
| Experiment | Test Error / Top-1 Acc. | Regularization |
|---|---|---|
| CIFAR-10 NASNet-A (7@2304) baseline | ~3.4% error | none |
| + Fixed-rate DropPath () | ~3.25% error | moderate |
| + ScheduledDropPath () | 2.97% error | strong, ramped |
| + ScheduledDropPath + Cutout | 2.40% error | state-of-the-art |
| ImageNet NASNet-A (7@1920) baseline | ~79.5% top-1 | none |
| + Fixed-rate DropPath | ~80.0% top-1 | moderate |
| + ScheduledDropPath | 80.8% top-1 | strong, ramped |
| ImageNet NASNet-A (6@4032) + ScheduledDP | 82.7% top-1 (best published) | strong, ramped |
ScheduledDropPath led to state-of-the-art CIFAR-10 and ImageNet performances, with significant reductions in computational complexity (FLOPs) relative to previous best models. The compound effect of ScheduledDropPath with other regularizers (e.g., cutout) enabled test error reductions approaching one full percentage point. A plausible implication is that ramped stochastic regularization synergizes particularly effectively with neural architecture search–based models comprising many parallel computational paths.
7. Broader Implications
ScheduledDropPath’s development was motivated by architectural search–designed convolutional models (NASNet) with deep, multi-branch cells. By providing epoch-dependent stochastic path dropping, it addresses limitations of static regularization in dynamic learning environments. Its efficacy in large-scale image recognition benchmarks established a methodological precedent for time-varying regularization in deep networks. The approach remains significant for future neural architecture search efforts, especially where the learned topologies induce overparameterization and complex inter-branch dependencies (Zoph et al., 2017).