Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomized Progressive Training (RPT)

Updated 24 March 2026
  • Randomized Progressive Training (RPT) is a training paradigm where neural networks are incrementally grown using stochastic updates to optimize resource use.
  • RPT algorithms employ randomized block selection, data subset sampling, and scheduled subnetwork activation to efficiently reduce computational costs.
  • Empirical results show that RPT can achieve up to 10× speedups and improved generalization, supported by convergence guarantees and deep learning theory.

Randomized Progressive Training (RPT) is a family of stochastic, resource-efficient training algorithms that generalize and rigorously analyze the empirically motivated "progressive training" paradigm, particularly in large-scale neural networks. RPT algorithms incrementally grow network topology or active parameter sets during training, employing randomization either in data sampling, architectural subnetwork selection, or update scheduling. Key objectives include reducing wall-clock time, FLOP consumption, and sample complexity while maintaining or improving generalization compared to full-model or fully deterministic progressive approaches.

1. Formal Foundations and Algorithmic Structure

RPT extends deterministic Progressive Training (PT)—in which layers or modules are incrementally grown and optimized—to a randomized regime that admits theoretical convergence guarantees. Formally, for parameter vector xRdx \in \mathbb{R}^d segmented into blocks x(1),,x(B)x^{(1)},\ldots,x^{(B)}, RPT seeks to minimize a L-smooth objective f:RdRf: \mathbb{R}^d \to \mathbb{R} by, at each iteration, performing a stochastic update involving only a (typically growing) randomly-selected subblock of parameters.

The general algorithm can be abstractly described as follows (Szlendak et al., 2023):

  1. At each step kk, sample a block-sketch operator CkC^k (block-diagonal, selecting a prefix of blocks with given probabilities).
  2. Perform a parameter update:

xk+1=xkηCkf(xk)x^{k+1} = x^k - \eta \, C^k \nabla f(x^k)

where E[Ck]=IE[C^k] = I. This is an unbiased, block-randomized gradient descent.

When instantiated in neural networks, RPT is realized in several domains:

2. Connection to Stochastic Optimization Paradigms

RPT is a special case of Randomized Coordinate Descent (RCD) or Sketched Gradient Descent (SkGD) (Szlendak et al., 2023), with the "progressive" aspect encoded as a schedule on the expected size or composition of the active parameter set. For a decomposition with per-block smoothness LiL_i and cost cic_i, the theory prescribes sampling probabilities piLi/cip_i \propto \sqrt{L_i/c_i}, which provably reduces total expected computational cost. This subsumes prior deterministic progressive expansion heuristics as a limiting case.

Key theoretical results include:

  • Strongly convex objectives: RPT with step size η1/LP\eta \leq 1/L_P satisfies

E[xkx2](1μη)kx0x2\mathbb{E}[\|x^k - x^*\|^2] \leq (1-\mu\eta)^k \|x^0 - x^*\|^2

  • Convex and nonconvex cases: optimal O(1/T)O(1/T) rates in function value and gradient norm.

This formalization yields provable cost-accuracy trade-offs previously unavailable for progressive training.

3. Data and Subnetwork Randomization in Deep Learning

In deep neural networks, RPT methods introduce randomization at various axes:

  • Subset Sampled Progressive Neural Network Learning (PNNL): At each progression step kk, train only on a random subset SkTS_k \subset T of the full data (Tran et al., 2020). Three schemes are proposed:
    • Uniform random sampling.
    • Top-MM-loss selection: focusing on samples with highest loss under current model.
    • Clustered top-MM-loss: top-loss representatives from each cluster.
    • Empirically, random sampling at small fractions (e.g., α=10%\alpha=10\%) outperforms more "informed" schemes due to block diversity.
  • Progressive Subnetwork/Layer Dropping (RaPTr, CopRA): At each iteration or training stage, select a random architectural subnetwork (layers, adapters) to activate and optimize (Panigrahi et al., 2024, Zhuang et al., 2024). For CopRA, adapters are activated independently as δlBernoulli(p(t))\delta_l \sim \mathrm{Bernoulli}(p(t)), with p(t)p(t) increasing from 0 to 1 over the course of training. RaPTr progressively increases active layers or submodules using stage-wise growing Bernoulli sampling per layer.

The stochastic growth and activation schedules are theoretically motivated to balance per-step cost reduction with coverage over the parameter/data space.

4. Empirical Results and Applications

Empirical evaluations consistently demonstrate that RPT methods achieve substantial computational savings with competitive or improved downstream performance:

  • Subset Sampling in PNNL: RPT-Uniform on Caltech256 reduces per-block training time from 72.2s to 6.7s (≈10×), and end-to-end experiment time from 18.4h to 5.2h (≈3.5×). Test accuracy at α=10%\alpha=10\% matches or slightly exceeds baseline (e.g., 80.27% vs. 79.48%) (Tran et al., 2020).
  • Random Path in Transformer Pretraining: On BERT and UL2, RaPTr achieves 20–33% FLOP savings, matches baseline validation loss, and improves downstream QA and SuperGLUE metrics by up to +2% (Panigrahi et al., 2024).
  • Progressive LoRA Adapters (CopRA): Merged CopRA models in federated and multi-task setups deliver higher accuracy than standard LoRA (e.g., FL–dtd: 64.07% vs. 54.37%). CopRA is robust to structured/unstructured pruning and tolerates higher learning rates (Zhuang et al., 2024).

Common best practices include scheduling the progression to cover 10–30% of the parameter/data space early, with randomization to maximize diversity and avoid overfitting.

5. Theoretical Guarantees and Stability Analysis

Several flavors of RPT come with stability and convergence analyses:

  • Coordinate Descent Framework: RPT updates produce unbiased gradient estimates with cost-dependent complexity bounds (Szlendak et al., 2023). The expected per-iteration cost is pici\sum p_i c_i, with optimal pip_i minimizing overall training cost subject to block smoothness and computational demands.
  • Subnetwork Dropout in Transformers: RaPTr analysis shows that if residual and LayerNorm structures are present, the loss perturbation due to stage transitions is bounded as O(1/L)O(1/\sqrt{L}) (with LL layers), ensuring stable learning across growing subnetworks (Panigrahi et al., 2024).
  • Shapley Value Optimization in Adapters: CopRA's random-drop schedule approximates Shapley-value regularization, encouraging each adapter to have strong marginal contribution irrespective of subnetwork (Zhuang et al., 2024). This leads to nearly linear mode connectivity, facilitating model merging.

6. Practical Recommendations and Domain-Specific Instantiations

Recommendations across RPT variants, established via empirical study:

  • Use small α\alpha (10–20%) for data subset sampling per progression step.
  • Default to uniform randomization unless distributional imbalance or forgetfulness becomes evident (Tran et al., 2020).
  • For progressive subnetworks, stagewise linear growth toward complete activation with always-on input/head layers is robust (e.g., RaPTr schedules: 6–8–10–12 on 12-layer models) (Panigrahi et al., 2024).
  • Online, per-block hyperparameter selection or learning-rate decay is preferable to static global schedules.
  • Fine-tuning the resulting model on the full parameter and data space can further polish generalization.

7. Extensions, Limitations, and Open Directions

RPT research connects and generalizes diverse “progressive,” “dropout,” and “block-randomized” training paradigms—encompassing subset sampling, random subnetwork activation, and cost-aware optimization. While convergence results require smoothness and (in some regimes) convexity, empirical results extend to deep, nonconvex architectures and LLM pretraining.

A plausible implication is that RPT, via induced diversity among successive blocks or layers, acts as a regularizer and may enhance model robustness in federated, multi-task, and pruning contexts (Zhuang et al., 2024).

Limitations include the need for careful schedule tuning, and reduced advantage in regimes where all parameters/data must eventually be optimized for maximal accuracy. Further work could address RPT's behavior under heavier-tailed data distributions, its asymptotic performance for very large networks, and its integration with non-uniform data and parameter access costs.


Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Progressive Training (RPT).