Progressive Depth Curriculum (PDC)

Updated 16 November 2025

Progressive Depth Curriculum (PDC) is a curriculum learning strategy that incrementally increases model architectural depth or target complexity during training to enhance efficiency and generalization.
It employs varied scheduling methods—from manual thresholding to adaptive pacing—to modulate recursion depth, layer counts, and frequency components tailored to task difficulty.
Empirical studies across tasks like recursive reasoning, depth estimation, and language model pretraining demonstrate significant speedups, accuracy gains, and reduced computational waste with PDC.

A progressive depth curriculum (PDC) is an explicit curriculum learning strategy where either a model’s architectural depth or the intrinsic difficulty of depth-related learning targets are increased stepwise during training. In contrast to traditional data-based curricula, which order samples from easy to hard, PDCs modulate the model’s processing or representational depth, such as recursion depth in iterative solvers, network layer count, or the complexity of frequency components in a multi-phase decoder. This paradigm has demonstrated significant efficiency and generalization benefits across multiple modalities including recursive reasoning, monocular depth estimation, 3D semantic scene completion, and LLM pretraining.

1. Formal Definition and Scheduling Principles

PDCs instantiate a mapping from normalized training progress $\rho\in [0,1]$ to one or more architectural depth hyperparameters. In recursive models, such as the Transformer Reasoning Machine (TRM) (Qasim et al., 11 Nov 2025), these are the recursion depth parameters $(n,T)$ —with $n$ “L-cycles” per “H-cycle,” $T$ “H-cycles,” and effective depth

$\mathcal{D}_{\text{eff}}(n, T) = T \cdot (n+1) \cdot n_L,$

where $n_L$ is the number of layers per cycle. The depth schedule is defined piecewise-constantly via curriculum thresholds $\tau_i$ and depth tuples $(n_i, T_i)$ :

$C_{\text{PDC}}(\rho) = \sum_{i=1}^K (n_i, T_i)\ \mathbb{1}_{[\tau_{i-1},\;\tau_i)}(\rho),$

with $K$ curriculum stages. The same structure arises in progressive stacking of model layers (Singh et al., 13 Jun 2025), where discrete depth increments $N_i$ (layer counts) occur at prescribed training stages.

In self-supervised monocular depth or depth fusion tasks, the curriculum is applied either over input source quality (e.g., LiDAR vs. stereo depth in CurriFlow (Lin et al., 14 Oct 2025)), over data domain difficulty (e.g., clear-to-adverse weather in WeatherDepth (Wang et al., 2023)), or over spectral/structural target components (e.g., frequency bands in DCDepth (Wang et al., 19 Oct 2024)).

The scheduling approach may be manual (fixed thresholds), adaptive (triggered by loss plateaus (Wang et al., 2023)), or computed via linear or exponential pacing (Lin et al., 14 Oct 2025). A core principle is to begin training with shallow, low-capacity, or “easier” model/data regimes to avoid early-stage overfitting or instability, and then progressively unlock greater depth or higher-difficulty targets.

2. Theoretical Rationale and Empirical Motivations

The main theoretical motivations for PDC are:

Mitigating early-stage overfitting: At early training stages, deeper or more powerful reasoning architectures exhibit a larger generalization gap $\mathcal{R}(\theta) = \mathbb{E}_{\text{test}}[\ell(\theta)] - \mathbb{E}_{\text{train}}[\ell(\theta)]$ , which is empirically proportional to $\mathcal{D}_{\text{eff}}$ for deep recursive models (Qasim et al., 11 Nov 2025).
Avoiding computational waste: Many targets require only limited depth/refinement. Running the full deep stack for every sample is inefficient. A PDC matches depth to learning phase and sample requirements, reducing waste (e.g., $\sim$ 76% in Sudoku-Extreme (Qasim et al., 11 Nov 2025)).
Stabilizing optimization: In fusion and completion tasks, high-quality shallow signals (e.g., LiDAR) anchor optimization, while late-phase exposure to noisier, more challenging sources (e.g., stereo) encourages robustness (Lin et al., 14 Oct 2025).
Regularized “global-to-local” learning: Stepped prediction in the frequency domain, from low- to high-frequency coefficients (as in DCDepth), enables the network to first capture the smooth global structure before committing capacity to local detail (Wang et al., 19 Oct 2024).

3. Algorithmic Instantiations and Implementation Details

For a $K$ -stage curriculum:

Epochs are divided via $\rho = e/E$ $ρ = e / E$ (normalized epoch), with thresholds $(\tau_1, ..., \tau_K)$ $(τ_{1}, ..., τ_{K})$ and depth tuples $(n_i, T_i)$ $(n_{i}, T_{i})$ . For Sudoku-Extreme:
- $\tau_1=0.3$ , $\tau_2=0.6$ ,
- $(n_1, T_1)=(2,1)$ , $(n_2,T_2)=(4,2)$ , $(n_3,T_3)=(6,3)$ (i.e., $\mathcal{D}_{\text{eff}}=6,20,42$ ).
At each epoch: set $(n, T) = C_{\text{PDC}}(\rho)$ ; recursively unroll the model to that depth.
Loss is unchanged—standard deeply-supervised cross-entropy plus halting loss.

A simplified pseudocode for curriculum scheduling is:

def get_PDC(ρ):
    if 0 <= ρ < τ1:  return depths[0]
    elif ρ < τ2:     return depths[1]
    else:            return depths[2]

Training proceeds with a scheduler calling this function per epoch; the deep recursion function is called with current

(n,T)

At stage $i$ (of $M$ ), model has depth $N_i$ ; new layers randomly initialized and briefly trained (old layers frozen), then all unfrozen for full-stage fine-tuning.
Data schedule matches progressive depth: each stage increases the proportion of more difficult text samples, e.g., from synthetic stories (easy) to web data (hard).
Curriculum advances after a fixed compute budget per stage, matching total FLOPs against baselines.

At training step $t$ , define fusion weight $\alpha(t) = \max(0, 1 - t/T)$ (linear decay over $T$ epochs).
Fused depth $D_{\text{fused}}(t) = \alpha(t) D_{\text{dense}} + (1-\alpha(t)) D_{\text{stereo}}$ , where $D_{\text{dense}}$ is LiDAR-completed depth, $D_{\text{stereo}}$ is stereo-predicted.
Curriculum: rely entirely on $D_{\text{dense}}$ early ( $\alpha=1$ ), then transition to $D_{\text{stereo}}$ ( $\alpha=0$ ) by $T_{\rm end}\approx 0.8$ of total training.
No additional regularizer is required, but depth consistency penalty can be added.

Progressive head predicts DCT coefficients by phase; each phase adds groups of frequencies (DC $\to$ low AC $\to$ high AC).
At phase $k$ , running coefficients $\mathcal{C}^{k} = \mathcal{C}^{k-1} + \Delta \mathcal{C}^k$ .
Intermediate reconstructions supervised with a scale-invariant log loss, later phases weighted higher: $L_d = 10\sum_{k=1}^N 0.8^{N-k} \mathrm{(si-loss)}$ .

4. Empirical Evidence and Performance Characteristics

PDC achieves strong Pareto improvements in both efficiency and generalization across diverse tasks:

On Sudoku-Extreme, PDC yields a 2.26 $\times$ training speedup (10.6 $\to$ 4.70 h) and a +0.33 pp absolute accuracy gain over fixed-depth (Table below) (Qasim et al., 11 Nov 2025):

Configuration	Time (h)	Exact (%)	Speedup
Baseline TRM	10.60	85.14	1.00
+ PDC only	4.70	85.47	2.26

In CurriFlow (Lin et al., 14 Oct 2025), curriculum-guided fusion of depth sources delivers a +0.2 mIoU gain over baseline pure-stereo training, and improves stability of early optimization as well as resilience to occlusion/texture loss.
In WeatherDepth (Wang et al., 2023), data-domain PDC across weather severity stages plus contrastive curriculum achieves improved absrel error by up to 0.059 absolute in snow/rain and no forgetting in clear scenes.
In DCDepth (Wang et al., 19 Oct 2024), progressive DCT-phase prediction improves both global structure and edge acuity, yielding state-of-the-art Abs Rel and $\delta<1.25$ metrics.
In CGLS (Singh et al., 13 Jun 2025), synchronizing progressive stack expansion with data curriculum gives +2.1 to +5 pts on zero-shot QA and +2.17% averaged across LLM benchmarks, with no extra compute.

PDC is distinguished from classic sample-based curricula (e.g., easy-to-hard ordering of data) and from purely architectural progressions that do not consider data pacing or task difficulty. Its primary mechanisms are:

Parametric architectural scheduling: Model capacity (e.g., recursion depth, layer count) is explicitly staged as learning matures.
Orthogonality to loss modification: In case studies such as (Qasim et al., 11 Nov 2025), PDC modifies only architecture; no changes to loss function or supervision weights are intrinsic to the curriculum. (Additional loss reweighting strategies, such as Hierarchical Supervision Weighting, can be optionally layered for further gains.)
Flexibility for hybrid designs: PDC may be combined with inter-domain curricula, contrastive losses to mitigate forgetting (Wang et al., 2023), or fused with attention-based robustness modules (Lin et al., 14 Oct 2025) for compounded effect.

Notably, curriculum-induced Pareto improvements—simultaneous acceleration and improved generalization—are rare in architectures of fixed capacity, but PDC demonstrates this phenomenon in both recursive reasoning and vision.

6. Practical Guidelines and Hyperparameter Tuning

The deployment of PDC requires judicious selection of stage thresholds, depth increments, and pacing:

Stage thresholds $(\tau_i)$ : Early shallow phases (e.g., first 30% of training), then moderate, then full depth. Manual grid search over a small discrete set (e.g., $\tau_1\in\{0.2,0.3,0.4\}$ ) suffices in practice (Qasim et al., 11 Nov 2025).
Depth tuples $(n_i, T_i)$ or layer increments: Anchor shallow and final depths to practical domain minima and maxima; scale stage count based on available compute and problem complexity.
Loss and optimization: PDC does not require special loss scheduling, but monitoring the generalization gap at curriculum transitions is advised. If the gap widens, consider finer phase granularity or smoothing transitions.
Compute allocation: For layer stacking curricula, allocate approximately 20% of each stage's compute to new-layer initialization, 80% to joint fine-tuning (Singh et al., 13 Jun 2025).
Validation: Tune schedule hyperparameters by measuring wall-clock time and accuracy at interim checkpoints, maximizing FLOPs reduction under strict accuracy constraints.

7. Applications and Adaptations Across Domains

PDC has been successfully applied to:

Recursive Reasoning: Training tiny recursive models that match or approach the performance of much larger non-recursive architectures (Qasim et al., 11 Nov 2025).
Depth Estimation under Adverse Conditions: Robust prediction in challenging domains (e.g., rain, snow, fog) using multi-stage data and spectral curricula (Wang et al., 2023, Wang et al., 19 Oct 2024).
3D Semantic Scene Completion: Fusing heterogeneous depth modalities with staged scheduling in dynamic environments (Lin et al., 14 Oct 2025).
LLM Pretraining: Compute- and sample-efficient stacking of transformer layers in large-scale pretraining, matched to sample/domain complexity (Singh et al., 13 Jun 2025).

A plausible implication is that, independent of modality, progressively increasing architectural or representational depth synchronized with model competence or data difficulty consistently yields efficiency, stability, and generalization benefits otherwise unavailable to static-depth or naïvely stacked models.

Summary Table: Core Mechanisms Across PDC Variants

Domain	Depth/Progression Target	Progression Schedule	Empirical Gain
Recursive Reasoning (CGAR)	Recursion depth (n,T)	3-stage (shallow-med-full)	2.26 $\times$ speedup, +0.33pp acc
Depth Estimation (WeatherDepth)	Data/weather severity	3-stage w/ adaptive switch	-0.059 absrel, no forget
3D Scene Completion (CurriFlow)	LiDAR $\rightarrow$ Stereo fusion	Linear decay over $0.8E$	+0.21 mIoU
LLM Pretraining (CGLS)	Transformer stack size (N)	3-5 stages, data-matched	+2.1--5pt zero-shot
Monocular Depth (DCDepth)	DCT components (freq phases)	Global-to-local (9 phases)	SOTA Abs Rel/ $\delta<1.25$

In sum, a PDC organizes the growth in model capacity or target complexity as a discrete or continuous function of training progression, ensuring both computational efficiency and enhanced generalization. This framework is adaptable to diverse domains with minimal intervention and is reliably associated with strict Pareto improvements under practical constraints.