Accelerating Training Speed of Tiny Recursive Models via Curriculum Guided Adaptive Recursion (2511.08653v1)

Published 11 Nov 2025 in cs.LG, cs.AI, cs.CL, and cs.NE

Abstract: Recursive reasoning models achieve remarkable performance on complex reasoning tasks through iterative refinement, enabling tiny networks to match LLMs thousands of times their size. However, training remains computationally expensive, prior work reporting approximately 36 GPU-hours per dataset, limiting broader adoption and research. We propose CGAR, a novel training methodology that applies curriculum learning to architectural depth rather than traditional data ordering. CGAR introduces two synergistic components: Progressive Depth Curriculum dynamically adjusts recursion depth from shallow to deep configurations during training, preventing early overfitting while reducing computational cost, and Hierarchical Supervision Weighting applies exponentially decaying importance to supervision steps, aligning loss weighting with observed gradient magnitude decay. On Sudoku-Extreme with 423,168 test puzzles, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours, 42% cost reduction) with only 0.63% accuracy drop (86.65% to 86.02%). Systematic ablations reveal Progressive Depth Curriculum alone achieves 2.26x speedup with 85.47% accuracy, demonstrating a rare Pareto improvement where architectural curriculum simultaneously enhances training efficiency and solution quality. CGAR-trained models exhibit superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Our work demonstrates that principled curriculum on architectural depth enables efficient training of recursive reasoning models on modest hardware. Code and models: https://github.com/Kaleemullahqasim/CGAR and https://huggingface.co/Kaleemullah/trm-cgar-sudoku

Summary

The paper introduces a novel CGAR method that leverages a Progressive Depth Curriculum to reduce computation by approximately 41.4% and prevent overfitting.
It employs Hierarchical Supervision Weighting to counteract gradient decay, reducing variance by roughly 40% and achieving a 1.71 times speedup in training time.
The approach maintains competitive accuracy with only a 0.63% drop, showing promise for efficient training in recursive reasoning models.

Accelerating Training Speed of Tiny Recursive Models via Curriculum Guided Adaptive Recursion

Introduction

The paper tackles the inefficiencies associated with training recursive reasoning models, specifically focusing on Tiny Recursive Models (TRM), which, despite their small size and ability to perform complex tasks, require significant training time. To address this, the paper introduces Curriculum-Guided Adaptive Recursion (CGAR), a novel training methodology leveraging curriculum learning applied to recursion depth rather than data ordering.

Methodology

Progressive Depth Curriculum

The CGAR framework introduces a Progressive Depth Curriculum (PDC), which dynamically adjusts recursion depth during training. Unlike traditional fixed-depth training which leads to overfitting and inefficient computation at early training stages, PDC gradually increases the network’s depth as training progresses. This is achieved through a piecewise-constant schedule that transitions between shallow, medium, and full-depth configurations. This progression prevents early-stage overfitting and reduces computational overhead by approximately 41.4%.

Hierarchical Supervision Weighting

Complementing PDC is Hierarchical Supervision Weighting (HSW), which addresses inefficiencies in uniform supervision across reasoning steps. HSW applies exponentially decaying importance to different supervision steps, aligning with observed gradient decay in recursive architectures. This approach ensures significant early-stage gradients are not overshadowed by the dwindling significance of later steps, resulting in approximately 40% reduction in gradient variance and accelerated convergence.

Implementation and Results

Under controlled conditions using a single A100 GPU, CGAR achieves a 1.71 times speedup in training time, reducing it from 10.93 hours to 6.38 hours while maintaining competitive accuracy with only a 0.63% drop. The detailed comparison to a baseline TRM using fixed-depth training shows that CGAR preserves model quality while significantly improving training efficiency.

Code Snippet

The implementation of CGAR involves integrating dynamic recursion and hierarchical weighting into the training loop:

def train_cgar(D, E, C_PDC, lambda_decay, eta, N_sup):
    theta = INIT_PARAMS()
    OPT   = ADAMW(theta, lr=eta)
    Z_lambda = (1 - lambda_decay**N_sup) / (1 - lambda_decay)

    for e in range(1, E+1):
        n, T = C_PDC(e / E)

        for X, Y_true in D:
            Y = EMBED(X)
            Z = ZERO_STATE_like(Y)
            L = 0.0

            for t in range(1, N_sup+1):
                Y, Z = deep_recursion(Y, Z, X, n, T)
                logits = OUT_HEAD(Y)
                q      = SIGMOID(HALT_HEAD(Y))
                w      = lambda_decay**(t-1)
                L += w * CE(logits, Y_true) + BCE(q, MATCH(logits, Y_true))
                if MAX(q) > 0.5:
                    Y, Z = DETACH(Y), DETACH(Z)
                    break
                Y, Z = DETACH(Y), DETACH(Z)

            loss = L / Z_lambda
            OPT.zero_grad(); loss.backward(); OPT.step()

    return theta

This snippet captures the adaptive recursion and supervision mechanism central to CGAR, demonstrating its practical implementation in a PyTorch-like environment.

Discussion and Future Work

CGAR demonstrates the feasibility of applying curriculum learning to architectural parameters rather than conventional data paths, offering significant improvements in training time and efficiency without compromising accuracy. Future work may explore automated curriculum scheduling, broader task domain evaluations, and further optimization techniques to integrate CGAR principles into larger, pre-trained LLMs. This could potentially transform training paradigms across different domains, especially where logical reasoning and inferencing efficiency are critical.