Papers
Topics
Authors
Recent
2000 character limit reached

Curriculum-Guided Adaptive Recursion

Updated 16 November 2025
  • CGAR is a training method that dynamically adjusts recursion depth to balance computational cost and model expressivity.
  • It employs Progressive Depth Curriculum to schedule recursive parameters and Hierarchical Supervision Weighting to balance loss contributions.
  • Empirical results on Sudoku-Extreme show a 1.71× speedup and 42% cost reduction with minimal accuracy drop, affirming CGAR's efficiency.

Curriculum-Guided Adaptive Recursion (CGAR) is a training methodology designed to accelerate the training of Tiny Recursive Models (TRMs) for complex reasoning tasks. Unlike classical curriculum learning, which orders data by difficulty, CGAR applies curriculum principles directly to the architectural depth of recursive models. CGAR comprises two synergistic components: Progressive Depth Curriculum (PDC), which dynamically schedules the recursion depth over the course of training, and Hierarchical Supervision Weighting (HSW), which applies exponentially decaying importance to supervision steps to balance gradient contributions. This approach enables large reductions in training cost, prevents early-stage overfitting, and improves inference efficiency, demonstrated by empirical results on the Sudoku-Extreme benchmark.

1. Formal Definition and Underlying Principles

Let a Tiny Recursive Model be parameterized by recursion depth and cycles, denoted (n,T)N2(n, T) \in \mathbb{N}^2, with effective network depth

Deff(n,T)=T(n+1)nL,\mathcal{D}_{\mathrm{eff}}(n, T) = T\,(n+1)\,n_L,

where nLn_L is the number of transformer layers per recursion. Curriculum-Guided Adaptive Recursion (CGAR) prescribes a dual-schedule training routine:

  • Progressive Depth Curriculum (PDC): Schedules (n,T)(n, T) as a function of normalized epoch progress ρ=e/E\rho = e/E.
  • Hierarchical Supervision Weighting (HSW): Applies exponentially decaying weights to each of the NsupN_{\sup} deep-supervision steps in the loss.

The core principle is to order the architecture by depth—starting with shallow recursions early in training to lower FLOPs and minimize early overfitting, and gradually increasing depth as training progresses to enhance model expressivity for hard tasks.

2. Progressive Depth Curriculum

PDC schedules recursion parameters (n,T)(n, T) via a piecewise-constant function of training progress: CPDC(ρ)=i=1K(ni,Ti)1[τi1,τi)(ρ),0=τ0<τ1<<τK=1,\mathcal{C}_{\mathrm{PDC}}(\rho) = \sum_{i=1}^K (n_i, T_i)\,\mathbf{1}_{[\tau_{i-1}, \tau_i)}(\rho), \quad 0 = \tau_0 < \tau_1 < \cdots < \tau_K = 1, where 1[a,b)()\mathbf{1}_{[a, b)}(\cdot) is the indicator function. In empirical evaluation (K=3K=3), the curriculum proceeded as follows:

Epoch Progress ρ\rho (n,T)(n, T) Effective Depth (layers)
[0, 0.3) (2, 1) 6
[0.3, 0.6) (4, 2) 20
[0.6, 1.0] (6, 3) 42

At each epoch ee, ρ=e/E\rho = e/E is computed and (n,T)=CPDC(ρ)(n, T) = \mathcal{C}_{\mathrm{PDC}}(\rho) is set for all batches. The expected FLOPs per pass are reduced by interpolating between these depths: Eρ[Deff]=0.3×6+0.3×20+0.4×42=24.6,\mathbb{E}_\rho[\mathcal{D}_{\mathrm{eff}}] = 0.3 \times 6 + 0.3 \times 20 + 0.4 \times 42 = 24.6, representing a 41.4% reduction versus fixed depth-42, with a corresponding theoretical speedup of 1.71×1.71\times.

PDC also delays exposure to deep architectures until later in training, minimizing the risk of overfitting when model parameters are still poorly initialized. The resultant schedule is:

(n(e),T(e))={(2,1),e/E[0,0.3), (4,2),e/E[0.3,0.6), (6,3),e/E[0.6,1].(n(e), T(e)) = \begin{cases} (2, 1), & e/E \in [0, 0.3),\ (4, 2), & e/E \in [0.3, 0.6),\ (6, 3), & e/E \in [0.6, 1]. \end{cases}

3. Hierarchical Supervision Weighting

Hierarchical Supervision Weighting replaces uniform weighting of deep-supervision steps with exponentially decaying weights: wt=λt1Zλ,Zλ=1λNsup1λ;λ(0,1),w_t = \frac{\lambda^{t-1}}{Z_\lambda}, \quad Z_\lambda = \frac{1 - \lambda^{N_{\sup}}}{1 - \lambda}; \qquad \lambda \in (0,1), where tt indexes the supervision step. Empirically, λ=0.7\lambda = 0.7 best balanced speed and accuracy, yielding w[0.305,0.213,0.149,,0.002]w \approx [0.305, 0.213, 0.149, \ldots, 0.002]^\top for Nsup=16N_{\sup}=16, with a ratio w1/w16153×w_1/w_{16} \approx 153\times.

This weighting reflects the observed exponential decay in gradient norm across steps, θ(t)exp(αt)\|\nabla_\theta^{(t)}\| \propto \exp(-\alpha t) (α0.357\alpha \approx 0.357). HSW therefore equalizes the magnitude of weighted gradients, reducing gradient variance by approximately 40%, accelerating convergence, and improving the stability of optimization.

The cross-entropy loss with HSW is: LHSW(θ)=1Zλt=1Nsupλt1 CE(hout(y(t)),y),\mathcal{L}_{\mathrm{HSW}}(\theta) = \frac{1}{Z_\lambda} \sum_{t=1}^{N_{\sup}} \lambda^{t-1} \ \ell_{\mathrm{CE}}(h_{\mathrm{out}}(y^{(t)}), y^*), where CE\ell_{\mathrm{CE}} is the cross-entropy between output at step tt and the ground-truth label yy^*.

4. Integrated Training Algorithm and Implementation Specifics

The combination of PDC and HSW is instantiated in the CGAR algorithm. A compact pseudocode (Algorithm 1) reflects the two-stage curriculum over epochs and step-weighted loss:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
for epoch in range(1, E+1):
    rho = epoch / E
    (n, T) = C_PDC(rho)  # Progressive Depth Curriculum
    for (X, Y_true) in batches(D):
        Y, Z = Embed(X), zero_like(Y)
        loss = 0
        for t in range(1, N_sup+1):
            Y, Z = deep_recursion(Y, Z, X, n, T)
            logits = OutHead(Y)
            halt_p = HaltHead(Y)  # Adaptive computation halting
            w = lambda_ ** (t-1)
            loss += w * CE(logits, Y_true) + beta * BCE(halt_p, indicator(logits == Y_true))
            if max(halt_p) > 0.5:
                break
            detach(Y, Z)  # Stop-gradient except final cycle
        loss /= Z_lambda  # Normalize by partition for HSW
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Key implementation choices:

  • Curriculum thresholds: (0.3,0.6)(0.3, 0.6).
  • Depths: (2,1)(4,2)(6,3)(2,1)\to(4,2)\to(6,3).
  • HSW decay: λ=0.7\lambda=0.7 (chosen by ablation).
  • Halting weight: β=0.5\beta=0.5.
  • Optimizer: AdamW, learning rate 5×1045\times10^{-4}, cosine schedule with warmup.
  • Batch size: 768 (A100 GPU), automatic mixed precision (FP16).
  • Gradient clipping at norm 1.0; gradients detached for all but last H-cycle.

5. Empirical Results and Quantitative Analysis

Evaluations were performed on the Sudoku-Extreme data set ($423,168$ test puzzles). Primary metrics include exact match accuracy, token accuracy, training time, and inference efficiency. Main results are summarized below:

Metric Baseline TRM CGAR
Exact accuracy (%) 86.65 86.02
Token accuracy (%) 95.01 94.72
Training time (h) 10.93 6.38
Speedup vs baseline 1.0× 1.71×
A100 cost (\$2/hr, USD) 21.86 12.76

CGAR preserves nearly all baseline accuracy (0.63%-0.63 \% drop), with a 1.71×1.71\times reduction in training time and 42% lower training cost.

Inference metrics for the learned halting (ACT) mechanism:

  • Halting accuracy: 100% (vs 98.5% baseline).
  • Average reasoning steps: 5.52 (11% fewer than the 5.85 in baseline).

Ablation studies revealed:

  • PDC-only yields 2.26×2.26\times speedup and a slight accuracy increase (+0.33%+0.33\%), representing a rare Pareto improvement.
  • HSW-only achieves 1.61×1.61\times speedup but at a cost of significant accuracy decrease (6.51%-6.51\%).
  • Combined CGAR provides 1.71×1.71\times speedup and recovers most of the baseline accuracy (2.38%-2.38\%).

Hyperparameter sensitivity for the HSW decay parameter λ\lambda follows a U-shaped curve; values of λ<0.6\lambda<0.6 destabilize training, while λ>0.8\lambda>0.8 under-emphasize early steps.

6. Practical Implications, Limitations, and Extensions

The principal outcome of CGAR is the demonstration that curriculum on architectural depth alone—not data difficulty—can yield both computational and generalization benefits for recursive neural models. PDC counteracts early overfitting by initializing training with shallow architectures, while HSW matches loss weighting to effective information content at each recursion step, reducing noise in parameter updates.

Notable limitations:

  • Experiments were exclusively on Sudoku-Extreme; the generality to tasks such as ARC-AGI, GSM8K, and MATH is untested.
  • Thresholds for curriculum phases and the decay rate λ\lambda were manually tuned; meta-learning or automated curriculum scheduling may offer greater robustness.
  • Current schedules are epoch-based; potential exists for sample-adaptive depth schedules.

Broader impacts:

  • A 42% reduction in training cost democratizes access to recursive reasoning research by reducing hardware requirements.
  • The architectural curriculum and recursion-aware supervision principles underlying CGAR may generalize to iterative models such as diffusion networks, neural ODEs, and program synthesis.
  • Improved inference efficiency, including on learned halting metrics, is beneficial for deployment in real-time or resource-limited environments.

In sum, Curriculum-Guided Adaptive Recursion introduces a curriculum on architectural features rather than solely data, coupled with an information-theoretic supervision schedule. This enables substantial speedups, cost reductions, and, in the case of PDC alone, even net accuracy gains for recursive neural systems, providing a foundation for more efficient training of iterative reasoning architectures.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Curriculum-Guided Adaptive Recursion (CGAR).