Randomized Progressive Training (RPT)
- Randomized Progressive Training (RPT) is a training paradigm where neural networks are incrementally grown using stochastic updates to optimize resource use.
- RPT algorithms employ randomized block selection, data subset sampling, and scheduled subnetwork activation to efficiently reduce computational costs.
- Empirical results show that RPT can achieve up to 10× speedups and improved generalization, supported by convergence guarantees and deep learning theory.
Randomized Progressive Training (RPT) is a family of stochastic, resource-efficient training algorithms that generalize and rigorously analyze the empirically motivated "progressive training" paradigm, particularly in large-scale neural networks. RPT algorithms incrementally grow network topology or active parameter sets during training, employing randomization either in data sampling, architectural subnetwork selection, or update scheduling. Key objectives include reducing wall-clock time, FLOP consumption, and sample complexity while maintaining or improving generalization compared to full-model or fully deterministic progressive approaches.
1. Formal Foundations and Algorithmic Structure
RPT extends deterministic Progressive Training (PT)—in which layers or modules are incrementally grown and optimized—to a randomized regime that admits theoretical convergence guarantees. Formally, for parameter vector segmented into blocks , RPT seeks to minimize a L-smooth objective by, at each iteration, performing a stochastic update involving only a (typically growing) randomly-selected subblock of parameters.
The general algorithm can be abstractly described as follows (Szlendak et al., 2023):
- At each step , sample a block-sketch operator (block-diagonal, selecting a prefix of blocks with given probabilities).
- Perform a parameter update:
where . This is an unbiased, block-randomized gradient descent.
When instantiated in neural networks, RPT is realized in several domains:
- Incremental topology/parameter growth, optimizing only the new block/adapter at each step (Tran et al., 2020).
- Randomized subnetwork selection per batch during pretraining, as in random path or random layer-drop (Panigrahi et al., 2024, Zhuang et al., 2024).
- Block-randomized update scheduling, analyzed via coordinate descent theory (Szlendak et al., 2023).
2. Connection to Stochastic Optimization Paradigms
RPT is a special case of Randomized Coordinate Descent (RCD) or Sketched Gradient Descent (SkGD) (Szlendak et al., 2023), with the "progressive" aspect encoded as a schedule on the expected size or composition of the active parameter set. For a decomposition with per-block smoothness and cost , the theory prescribes sampling probabilities , which provably reduces total expected computational cost. This subsumes prior deterministic progressive expansion heuristics as a limiting case.
Key theoretical results include:
- Strongly convex objectives: RPT with step size satisfies
- Convex and nonconvex cases: optimal rates in function value and gradient norm.
This formalization yields provable cost-accuracy trade-offs previously unavailable for progressive training.
3. Data and Subnetwork Randomization in Deep Learning
In deep neural networks, RPT methods introduce randomization at various axes:
- Subset Sampled Progressive Neural Network Learning (PNNL): At each progression step , train only on a random subset of the full data (Tran et al., 2020). Three schemes are proposed:
- Uniform random sampling.
- Top--loss selection: focusing on samples with highest loss under current model.
- Clustered top--loss: top-loss representatives from each cluster.
- Empirically, random sampling at small fractions (e.g., ) outperforms more "informed" schemes due to block diversity.
- Progressive Subnetwork/Layer Dropping (RaPTr, CopRA): At each iteration or training stage, select a random architectural subnetwork (layers, adapters) to activate and optimize (Panigrahi et al., 2024, Zhuang et al., 2024). For CopRA, adapters are activated independently as , with increasing from 0 to 1 over the course of training. RaPTr progressively increases active layers or submodules using stage-wise growing Bernoulli sampling per layer.
The stochastic growth and activation schedules are theoretically motivated to balance per-step cost reduction with coverage over the parameter/data space.
4. Empirical Results and Applications
Empirical evaluations consistently demonstrate that RPT methods achieve substantial computational savings with competitive or improved downstream performance:
- Subset Sampling in PNNL: RPT-Uniform on Caltech256 reduces per-block training time from 72.2s to 6.7s (≈10×), and end-to-end experiment time from 18.4h to 5.2h (≈3.5×). Test accuracy at matches or slightly exceeds baseline (e.g., 80.27% vs. 79.48%) (Tran et al., 2020).
- Random Path in Transformer Pretraining: On BERT and UL2, RaPTr achieves 20–33% FLOP savings, matches baseline validation loss, and improves downstream QA and SuperGLUE metrics by up to +2% (Panigrahi et al., 2024).
- Progressive LoRA Adapters (CopRA): Merged CopRA models in federated and multi-task setups deliver higher accuracy than standard LoRA (e.g., FL–dtd: 64.07% vs. 54.37%). CopRA is robust to structured/unstructured pruning and tolerates higher learning rates (Zhuang et al., 2024).
Common best practices include scheduling the progression to cover 10–30% of the parameter/data space early, with randomization to maximize diversity and avoid overfitting.
5. Theoretical Guarantees and Stability Analysis
Several flavors of RPT come with stability and convergence analyses:
- Coordinate Descent Framework: RPT updates produce unbiased gradient estimates with cost-dependent complexity bounds (Szlendak et al., 2023). The expected per-iteration cost is , with optimal minimizing overall training cost subject to block smoothness and computational demands.
- Subnetwork Dropout in Transformers: RaPTr analysis shows that if residual and LayerNorm structures are present, the loss perturbation due to stage transitions is bounded as (with layers), ensuring stable learning across growing subnetworks (Panigrahi et al., 2024).
- Shapley Value Optimization in Adapters: CopRA's random-drop schedule approximates Shapley-value regularization, encouraging each adapter to have strong marginal contribution irrespective of subnetwork (Zhuang et al., 2024). This leads to nearly linear mode connectivity, facilitating model merging.
6. Practical Recommendations and Domain-Specific Instantiations
Recommendations across RPT variants, established via empirical study:
- Use small (10–20%) for data subset sampling per progression step.
- Default to uniform randomization unless distributional imbalance or forgetfulness becomes evident (Tran et al., 2020).
- For progressive subnetworks, stagewise linear growth toward complete activation with always-on input/head layers is robust (e.g., RaPTr schedules: 6–8–10–12 on 12-layer models) (Panigrahi et al., 2024).
- Online, per-block hyperparameter selection or learning-rate decay is preferable to static global schedules.
- Fine-tuning the resulting model on the full parameter and data space can further polish generalization.
7. Extensions, Limitations, and Open Directions
RPT research connects and generalizes diverse “progressive,” “dropout,” and “block-randomized” training paradigms—encompassing subset sampling, random subnetwork activation, and cost-aware optimization. While convergence results require smoothness and (in some regimes) convexity, empirical results extend to deep, nonconvex architectures and LLM pretraining.
A plausible implication is that RPT, via induced diversity among successive blocks or layers, acts as a regularizer and may enhance model robustness in federated, multi-task, and pruning contexts (Zhuang et al., 2024).
Limitations include the need for careful schedule tuning, and reduced advantage in regimes where all parameters/data must eventually be optimized for maximal accuracy. Further work could address RPT's behavior under heavier-tailed data distributions, its asymptotic performance for very large networks, and its integration with non-uniform data and parameter access costs.
Key References:
- "Subset Sampling For Progressive Neural Network Learning" (Tran et al., 2020)
- "CopRA: A Progressive LoRA Training Strategy" (Zhuang et al., 2024)
- "Understanding Progressive Training Through the Framework of Randomized Coordinate Descent" (Szlendak et al., 2023)
- "Efficient Stagewise Pretraining via Progressive Subnetworks" (Panigrahi et al., 2024)