Curriculum-Guided Adaptive Recursion
- CGAR is a training method that dynamically adjusts recursion depth to balance computational cost and model expressivity.
- It employs Progressive Depth Curriculum to schedule recursive parameters and Hierarchical Supervision Weighting to balance loss contributions.
- Empirical results on Sudoku-Extreme show a 1.71× speedup and 42% cost reduction with minimal accuracy drop, affirming CGAR's efficiency.
Curriculum-Guided Adaptive Recursion (CGAR) is a training methodology designed to accelerate the training of Tiny Recursive Models (TRMs) for complex reasoning tasks. Unlike classical curriculum learning, which orders data by difficulty, CGAR applies curriculum principles directly to the architectural depth of recursive models. CGAR comprises two synergistic components: Progressive Depth Curriculum (PDC), which dynamically schedules the recursion depth over the course of training, and Hierarchical Supervision Weighting (HSW), which applies exponentially decaying importance to supervision steps to balance gradient contributions. This approach enables large reductions in training cost, prevents early-stage overfitting, and improves inference efficiency, demonstrated by empirical results on the Sudoku-Extreme benchmark.
1. Formal Definition and Underlying Principles
Let a Tiny Recursive Model be parameterized by recursion depth and cycles, denoted , with effective network depth
where is the number of transformer layers per recursion. Curriculum-Guided Adaptive Recursion (CGAR) prescribes a dual-schedule training routine:
- Progressive Depth Curriculum (PDC): Schedules as a function of normalized epoch progress .
- Hierarchical Supervision Weighting (HSW): Applies exponentially decaying weights to each of the deep-supervision steps in the loss.
The core principle is to order the architecture by depth—starting with shallow recursions early in training to lower FLOPs and minimize early overfitting, and gradually increasing depth as training progresses to enhance model expressivity for hard tasks.
2. Progressive Depth Curriculum
PDC schedules recursion parameters via a piecewise-constant function of training progress: where is the indicator function. In empirical evaluation (), the curriculum proceeded as follows:
| Epoch Progress | Effective Depth (layers) | |
|---|---|---|
| [0, 0.3) | (2, 1) | 6 |
| [0.3, 0.6) | (4, 2) | 20 |
| [0.6, 1.0] | (6, 3) | 42 |
At each epoch , is computed and is set for all batches. The expected FLOPs per pass are reduced by interpolating between these depths: representing a 41.4% reduction versus fixed depth-42, with a corresponding theoretical speedup of .
PDC also delays exposure to deep architectures until later in training, minimizing the risk of overfitting when model parameters are still poorly initialized. The resultant schedule is:
3. Hierarchical Supervision Weighting
Hierarchical Supervision Weighting replaces uniform weighting of deep-supervision steps with exponentially decaying weights: where indexes the supervision step. Empirically, best balanced speed and accuracy, yielding for , with a ratio .
This weighting reflects the observed exponential decay in gradient norm across steps, (). HSW therefore equalizes the magnitude of weighted gradients, reducing gradient variance by approximately 40%, accelerating convergence, and improving the stability of optimization.
The cross-entropy loss with HSW is: where is the cross-entropy between output at step and the ground-truth label .
4. Integrated Training Algorithm and Implementation Specifics
The combination of PDC and HSW is instantiated in the CGAR algorithm. A compact pseudocode (Algorithm 1) reflects the two-stage curriculum over epochs and step-weighted loss:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
for epoch in range(1, E+1): rho = epoch / E (n, T) = C_PDC(rho) # Progressive Depth Curriculum for (X, Y_true) in batches(D): Y, Z = Embed(X), zero_like(Y) loss = 0 for t in range(1, N_sup+1): Y, Z = deep_recursion(Y, Z, X, n, T) logits = OutHead(Y) halt_p = HaltHead(Y) # Adaptive computation halting w = lambda_ ** (t-1) loss += w * CE(logits, Y_true) + beta * BCE(halt_p, indicator(logits == Y_true)) if max(halt_p) > 0.5: break detach(Y, Z) # Stop-gradient except final cycle loss /= Z_lambda # Normalize by partition for HSW optimizer.zero_grad() loss.backward() optimizer.step() |
Key implementation choices:
- Curriculum thresholds: .
- Depths: .
- HSW decay: (chosen by ablation).
- Halting weight: .
- Optimizer: AdamW, learning rate , cosine schedule with warmup.
- Batch size: 768 (A100 GPU), automatic mixed precision (FP16).
- Gradient clipping at norm 1.0; gradients detached for all but last H-cycle.
5. Empirical Results and Quantitative Analysis
Evaluations were performed on the Sudoku-Extreme data set ($423,168$ test puzzles). Primary metrics include exact match accuracy, token accuracy, training time, and inference efficiency. Main results are summarized below:
| Metric | Baseline TRM | CGAR |
|---|---|---|
| Exact accuracy (%) | 86.65 | 86.02 |
| Token accuracy (%) | 95.01 | 94.72 |
| Training time (h) | 10.93 | 6.38 |
| Speedup vs baseline | 1.0× | 1.71× |
| A100 cost (\$2/hr, USD) | 21.86 | 12.76 |
CGAR preserves nearly all baseline accuracy ( drop), with a reduction in training time and 42% lower training cost.
Inference metrics for the learned halting (ACT) mechanism:
- Halting accuracy: 100% (vs 98.5% baseline).
- Average reasoning steps: 5.52 (11% fewer than the 5.85 in baseline).
Ablation studies revealed:
- PDC-only yields speedup and a slight accuracy increase (), representing a rare Pareto improvement.
- HSW-only achieves speedup but at a cost of significant accuracy decrease ().
- Combined CGAR provides speedup and recovers most of the baseline accuracy ().
Hyperparameter sensitivity for the HSW decay parameter follows a U-shaped curve; values of destabilize training, while under-emphasize early steps.
6. Practical Implications, Limitations, and Extensions
The principal outcome of CGAR is the demonstration that curriculum on architectural depth alone—not data difficulty—can yield both computational and generalization benefits for recursive neural models. PDC counteracts early overfitting by initializing training with shallow architectures, while HSW matches loss weighting to effective information content at each recursion step, reducing noise in parameter updates.
Notable limitations:
- Experiments were exclusively on Sudoku-Extreme; the generality to tasks such as ARC-AGI, GSM8K, and MATH is untested.
- Thresholds for curriculum phases and the decay rate were manually tuned; meta-learning or automated curriculum scheduling may offer greater robustness.
- Current schedules are epoch-based; potential exists for sample-adaptive depth schedules.
Broader impacts:
- A 42% reduction in training cost democratizes access to recursive reasoning research by reducing hardware requirements.
- The architectural curriculum and recursion-aware supervision principles underlying CGAR may generalize to iterative models such as diffusion networks, neural ODEs, and program synthesis.
- Improved inference efficiency, including on learned halting metrics, is beneficial for deployment in real-time or resource-limited environments.
In sum, Curriculum-Guided Adaptive Recursion introduces a curriculum on architectural features rather than solely data, coupled with an information-theoretic supervision schedule. This enables substantial speedups, cost reductions, and, in the case of PDC alone, even net accuracy gains for recursive neural systems, providing a foundation for more efficient training of iterative reasoning architectures.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free