Automated Progressive Growth (AutoProg)

Updated 24 March 2026

AutoProg is a dynamic neural network training method that progressively increases model capacity through scheduled growth.
It employs automated growth operators to expand depth, width, and input/output dimensions while preserving performance.
Empirical studies show AutoProg achieves significant accuracy gains and computational savings across vision, language, and RL tasks.

Automated Progressive Growth (AutoProg) encompasses a class of neural network training frameworks in which model capacity—such as layers, width, input size, outputs, or unfreezing of parameters—is systematically and adaptively increased during the training process. The core motivation is to couple the computational efficiency of small networks in early learning with the representational power of large architectures required for final performance. By automating when, where, and how to expand a model, AutoProg obviates the need for a priori architecture selection, enables adaptation to evolving data streams or tasks, and regularizes training via capacity scheduling. AutoProg methods now span a wide array of domains including classification, large vision/LLMs, generative adversarial networks, reinforcement learning, and topology optimization, with state-of-the-art empirical efficiency and accuracy in each.

1. Core Principles and Canonical Algorithms

All AutoProg methodologies share several structural elements:

A growth schedule $\Psi=\{\psi_1,\cdots,\psi_K\}$ , a sequence of increasingly complex sub-networks culminating in the target architecture.
Growth operators $\zeta$ that specify transformations (e.g., stacking layers, widening, doubling sequence length, enlarging output layer) and parameter initialization schemes for network expansion.
Automated triggers for growth based on explicit metrics (e.g., accuracy improvement, loss/return plateau, grayness, parameter gradients, runtime considerations), eliminating the need for manual intervention.
Safeguarding of function continuity or prior task performance at each expansion point, via identity initialization, momentum transfer, or closed-form output-layer recalibration.

Key algorithmic patterns, all present in the literature, include the “momentum growth” (MoGrow) operator for smooth parameter transfer (Li et al., 2022, Li et al., 2024), one-shot and zero-shot sub-network ranking for efficient schedule search (Li et al., 2024), automated output-layer neuron expansion for open-set classification (Venkatesan et al., 2016), and RL-specific growth criteria driven by policy improvement or gradient norms (Fehring et al., 13 Jun 2025).

A generic pseudocode schema for AutoProg is:

initialize minimal network ψ₁ with parameters ω₁ = rand()
for k in 1...K:
    while not growth_criteria_met(ψ_k):
        train ψ_k for τ_k steps
    candidate_set = expand_candidates(ψ_k)
    ψ_{k+1} = select_optimal(candidate_set)  # via one-shot, zero-shot, or objective evaluation
    ω_{k+1} = growth_operator(ψ_k, ω_k)

AutoProg algorithms have also been formalized as multi-objective optimization problems over loss and runtime, often relaxed via a per-stage sub-network selection with adaptive objectives (Li et al., 2022, Li et al., 2024):

$\min_{\omega, \Psi, \zeta}\left\{\, \mathcal{L}(\omega;\Psi, \zeta),\, \mathcal{T}(\Psi)\right\}$

2. Growth Operators and Architectural Manipulations

AutoProg methodologies implement a diverse suite of growth operators:

Depth expansion: Layer stacking via direct parameter copying or Net2Net techniques (Gu et al., 2020, Wen et al., 2019), often with identity or momentum-based parameter initialization to guarantee functional preservation.
Width expansion: Tiling or interpolating hidden units, FFN blocks, or attention heads (Gu et al., 2020, Li et al., 2022).
Input/output dimension growth: Expanding input sequence length (via pooling or truncation) or output count (e.g., for open-set classification, growing the output layer and recalibrating output weights) (Venkatesan et al., 2016).
Unfreezing parameters: In fine-tuning scenarios (e.g., diffusion models), AutoProg grows masking of trainable parameters rather than network size, using zero-shot trainability proxies for candidate ranking (Li et al., 2024).
Specialized block additions: GANs employ domain-specific blocks (e.g., fade-in upsampling layers; convolutional blocks with variable kernel/channel choice) with beam search or stochastic architecture search (Liu et al., 2021).

Notably, the combination of aggressive early growth and schedule automation enables discovery of near-optimal depth or width for CNNs and ViTs while reducing computational cost (Wen et al., 2019, Li et al., 2022). The momentum-based growth (MoGrow) operator—where a momentum copy of past weights is maintained and interpolated into new blocks—has proven essential for lossless capacity scaling in large vision transformers (Li et al., 2022, Li et al., 2024).

3. Growth Criteria: Scheduling, Triggers, and Search

AutoProg frameworks employ a spectrum of criteria for controlling when and how the network grows:

Convergent/Periodic accuracy improvement thresholds: Growth is triggered if validation accuracy does not improve by a threshold $\tau$ in a window of $K$ epochs (“convergent”); alternately, growth is forced every $K$ epochs regardless of accuracy (“periodic”), with additional stopping conditions (Wen et al., 2019).
Return or loss plateau detection: In RL, rolling improvements in episodic return or learning curves are tracked—growth is triggered when the performance increase falls below $\delta_{R,\min}$ or when gradient norms become large (Fehring et al., 13 Jun 2025).
Resource-aware schedule search: Explicit balancing of task loss and wall-clock time or FLOPs under per-stage parameterizations, with objective

$\min_{\psi\in\Lambda_k} \mathcal{L}(\psi)\cdot [\mathcal{T}(\psi)]^\alpha,$

where $\alpha$ tunes the accuracy/speed tradeoff; candidate selection is performed via one-shot (supernet-based) or zero-shot (NTK, ZiCo) proxy evaluation (Li et al., 2022, Li et al., 2024).

Architecture search for generative models: Candidate GAN architectures are generated via actions from a finite set (add conv in $G/D$ , fade-in, etc.), with beam search or greedy pruning guided by downstream task metrics (e.g., FID) (Liu et al., 2021).
Topology optimization parameters: For non-network contexts, e.g., adaptive tanh projection parameters $\beta$ increased automatically in response to the stagnation of objective function change (Dunning, 2024).

In all variants, schedule search is conducted efficiently such that the total computation remains smaller than training the target model from scratch, often due to parameter reuse and one/zero-shot heuristics.

4. Theoretical Guarantees and Computational Complexity

AutoProg algorithms frequently retain closed-form or algorithmic properties of their base architectures:

Consistency: Output-layer-only growth (as in ELM-based AutoProg) preserves the minimum-norm least squares fit on all previously seen data, ensuring no contradiction of prior labels (Venkatesan et al., 2016).
Universal approximation and rank: As long as the hidden subnetwork in ELM-like settings remains fixed and full-rank, growth of the output layer does not compromise representational capacity (Venkatesan et al., 2016).
Computational savings: On realistic benchmarks, AutoProg delivers substantial reductions in computation (e.g., up to 85.1% wall time reduction for ViT pretraining; near 2×–3× speedup in transformer pretraining or diffusion fine-tuning) with no drop in accuracy or generative quality (Li et al., 2022, Li et al., 2024, Gu et al., 2020). CNN context speedups are also reported at 57% (ResNet50) (Li et al., 2022).
Parameter update cost: Output-growth methods scale weight update cost with the accumulated number of classes, rather than the product of final output dimension and total samples (Venkatesan et al., 2016).

Empirical analyses confirm that the final solution found by AutoProg typically matches or slightly outperforms static, manually tuned deep networks, making the approach “near-optimal” across model classes (Wen et al., 2019, Li et al., 2022, Venkatesan et al., 2016).

5. Applications Across Domains

AutoProg now encompasses a broad methodological landscape:

Domain	Growth Type	Application Examples
Vision (ViT)	Depth, width, patches	Pretraining/fine-tuning on ImageNet, diffusion model transfer (Li et al., 2022, Li et al., 2024)
Classification	Output-layer (class)	Online multiclass with unknown class set; real-time robotics, intent detection (Venkatesan et al., 2016)
RL	Depth, width	Adaptive policy network scaling, sample-efficient learning in MuJoCo and MiniHack (Fehring et al., 13 Jun 2025)
NLP	Depth, width, sequence	BERT pre-training; compound scaling for GLUE, SQuAD (Gu et al., 2020)
GANs	Architecture blocks	Beam search for progressive GAN growing on CIFAR, LSUN (Liu et al., 2021)
CNNs	Depth, block number	Progressive ResNet/VGG growth on CIFAR, ImageNet (Wen et al., 2019)
Topology Optimization	Projection sharpness	Achieving binary solutions in SIMP (Dunning, 2024)

These techniques support classes arriving online, scalable pretraining/fine-tuning, real-time adaptation to task drift, and efficient combinatorial architecture search.

6. Quantitative Evidence and Benchmark Results

AutoProg approaches yield domain-best or domain-parity empirical results across archetypes:

Vision Transformers: On ImageNet, AutoProg achieves 74.4% top-1 (DeiT-S) with 40.7% GPU time reduction; VOLO-D1 reaches 82.7% at 85.1% acceleration (Li et al., 2022, Li et al., 2024).
RL: AutoProg increases Ant-v3 mean return from 3200 (static net) to 3800 (+18.8%), MiniHack-Berserk by +30%, with improved sample efficiency (Fehring et al., 13 Jun 2025).
LLMs: CompoundGrow reduces wall time for BERT-large pretraining by 82.2% with matched or improved GLUE/SQuAD metrics (Gu et al., 2020).
CNNs: Discovered depths via AutoGrow enable ResNet and PlainNet to reach or exceed hand-tuned accuracy with only 1–2× the time of training a fixed network—orders of magnitude less than classical NAS (Wen et al., 2019).
GANs: DGGAN/AutoProg beat ProgGAN on FID by 34–58% on CIFAR-10 and LSUN at all resolutions (Liu et al., 2021).
Topology Optimization: AutoProg-based $\beta$ scheduling achieves binary designs 30–70% faster than hand-crafted schedules, with equivalent optimization objectives (Dunning, 2024).

7. Strengths, Limitations, and Future Directions

Strengths:

Universality: Applies to architectures as diverse as CNNs, transformers, GANs, ELMs, and non-neural models.
Efficiency: Major reductions in wall time, memory, and manual tuning required for large-scale training.
Robustness: Matches or betters static baselines; seamless on-the-fly adaptation to new tasks or expanding class sets.

Limitations:

Requires sensible growth operator engineering and hyperparameter selection (e.g., growth thresholds, one-shot search balance $\alpha$ ).
For massive class expansion events (hundreds in one chunk), output-layer-only growth may be a cost bottleneck (Venkatesan et al., 2016).
Over-growth can introduce redundancy, slower wall-clock per update, or require explicit stage-capping (Fehring et al., 13 Jun 2025).
Most approaches do not address dynamic hidden-layer growth in the ELM setting, though extensions are feasible.

Future work includes combining AutoProg with mixed-precision, gradient checkpointing, more granular growth dimensions (embedding size, MLP ratio), and scaling to ever-larger multimodal or foundation architectures (Li et al., 2024, Li et al., 2022).