Tiny Recursive Models (TRMs) Overview

Updated 25 November 2025

Tiny Recursive Models (TRMs) are neural networks that use extreme parameter sharing to create deep architectures with a minimal memory footprint.
They achieve competitive performance in vision, reasoning, and program synthesis by recursively reusing a single block or set of parameters.
TRMs employ adaptive training techniques such as curriculum learning and low-rank adapters to optimize the trade-off between computational efficiency and model expressiveness.

Tiny Recursive Models (TRMs) are a class of neural networks in which extreme parameter sharing enables very deep, expressive architectures with a fraction of the memory footprint of conventional deep models. By looping a single block or small set of parameters multiple times, TRMs achieve performance competitive with much larger models, particularly on complex reasoning, program synthesis, and vision tasks, while presenting unique trade-offs in compute, design, and adaptation. This entry comprehensively catalogues the design principles, architectural realizations, training algorithms, practical impact, and current frontiers associated with TRMs.

1. Core Principles and Formal Framework

TRMs are defined by the recursive or looped reuse of a core parameter block—typically a convolutional or transformer layer—over multiple computational steps or layers, minimizing the network’s parameter count while preserving or expanding effective network depth. The canonical structure maintains two key state vectors (or tensors), updated via a recursive function:

A “latent” reasoning scratchpad, $z_t \in \mathbb{R}^D$ , and
An “answer” embedding, $y_t \in \mathbb{R}^D$ , both of which are iteratively updated using shared parameters. In transformer-based TRMs, these updates proceed according to

$z_{t+1} = f^{(L)}(z_t, y_t, x), \quad y_{t+1} = f^{(H)}(y_t, z_{t+1}),$

where $f^{(L)}$ and $f^{(H)}$ are modes of the same transformer block differing in query/key arrangement (Asadulaev et al., 21 Nov 2025, Jolicoeur-Martineau, 6 Oct 2025). In convolutional TRMs, a single convolutional kernel $W\in\mathbb{R}^{f\times f \times a \times b}$ is applied recursively:

$\tilde{x}_{t+1} = \sigma(W \ast x_t), \quad x_{t+1} = D_t\bigl(\mathrm{BN}_t(x_t + \tilde{x}_{t+1})\bigr),$

where $D_t$ is an optional downsampling, and $\mathrm{BN}_t$ is iteration-specific batch normalization (Coiffier et al., 2020).

A detailed taxonomy of parameter sharing in TRMs includes:

Fully-shared (tied) recursion: All iteration steps use the same weight matrices.
Relaxed recursion with LoRA: Additive, low-rank adapters restore some layer-wise specificity with small memory overhead (Bae et al., 28 Oct 2024).
Partial recursion with non-shared heads or blocks: Used for specific expressivity-stability trade-offs (Shen et al., 2021).

TRMs are distinguished from conventional deep models by achieving high effective network depth $D_{\text{eff}}$ with parameter counts orders of magnitude smaller than standard models.

2. Historical Development and Model Variants

The TRM family encompasses both vision and sequence architectures:

ThriftyNets established parameter-thrifty CNNs for image classification, employing a single convolutional layer recursively applied, interleaved with normalization, non-linear activation, optional downsampling, and short-range residual shortcuts. These networks demonstrated 91.0% accuracy on CIFAR-10 with 40K parameters, outperforming comparably sized Tiny ResNets and DenseNets (Coiffier et al., 2020).
Sliced Recursive Transformers (SReT) in computer vision realized deep vision transformers with recursive blocks and sliced group self-attention, achieving state-of-the-art top-1 accuracy on ImageNet with as few as 5–15M parameters (Shen et al., 2021).
Recursive Transformers for NLP address large model compression by repeating a single transformer block, with performance further improved by relaxing parameter tying via layer- and iteration-specific LoRA modules (Bae et al., 28 Oct 2024).
Reasoning TRMs (e.g., for ARC, Sudoku, Maze tasks) use a tiny transformer core with recursively updated states, achieving 44.6% accuracy on ARC-AGI-1 and unmatched performance among models with less than 0.01% the parameter count of state-of-the-art LLMs (Jolicoeur-Martineau, 6 Oct 2025).

Further variants include the introduction of curriculum learning on recursion depth to accelerate training (CGAR) (Qasim et al., 11 Nov 2025), and recursive architectures specifically for hierarchical or multi-frequency reasoning (HRM) (Jolicoeur-Martineau, 6 Oct 2025).

Model	Domain	Recursion Mechanism	Parameter Count	Notable Benchmark
ThriftyNet	Vision	Conv kernel repeated	40K–600K	91.0% CIFAR-10 (40K)
SReT	Vision	Transformer w/ slices	5.0–71.2M	77.6%–83.7% ImageNet
Relaxed Recursive Transformer	NLP	Tied block + LoRA	0.5–1.1B	>51.7% (7x bench. avg.)
TRM for ARC/Sudoku/Maze	Reasoning	Tiny 2-layer Transformer	7M	44.6% ARC-1 (7M)

3. Training Algorithms and Optimization

Canonical TRM training recasts deep reasoning or prediction as a sequence of recursive improvement steps, with distinct design choices for supervision:

Deep improvement supervision (DIS): Each recursive loop is provided a target, scheduled via a discrete diffusion process, yielding a curriculum where targets $y_s$ interpolate between initial prediction and the true answer in Hamming space. DIS produces a single cross-entropy loss per step, enabling up to 18× reductions in training FLOPs compared to classic stepwise supervision with halting (Asadulaev et al., 21 Nov 2025).
Curriculum-Guided Adaptive Recursion (CGAR): Training alternates between architectural phases of increasing recursion depth and introduces exponentially decaying loss weights per supervision step, matching the empirical decay of gradient magnitudes. CGAR yields up to 2.26× speedup for Sudoku-Extreme with negligible accuracy loss (Qasim et al., 11 Nov 2025).
Classic iterative refinement: Used in the original TRM and HRM reasoning models, with $n$ nested L-cycles and $T$ outer H-cycles per sample (Jolicoeur-Martineau, 6 Oct 2025).

Empirical findings consistently indicate that deeper recursion improves generalization on hard reasoning tasks (e.g., each increment in H- or L-cycle boosts exact grid prediction on Sudoku), but the returns diminish and must be balanced against compute constraints.

4. Empirical Behavior and Scaling Properties

TRMs empirically rival much larger models (LLMs, classical deep nets) across vision and reasoning domains, with test set accuracies on par with, or exceeding, traditional models at orders of magnitude lower parameter counts:

On CIFAR-10, ThriftyNet exceeds 91.0% with ≈40K parameters—outperforming Tiny ResNet and DenseNet variants below 50K parameters (Coiffier et al., 2020).
On ARC-AGI-1, “medium” TRMs (7M) reach 44.6%–45% accuracy, outperforming Claude 3.7, o3-mini-high, and Gemini 2.5 Pro, while LLMs such as Deepseek R1 (671B) score only 15.8% (Jolicoeur-Martineau, 6 Oct 2025).
Recursive Gemma 1B beats same-sized vanilla models (TinyLlama/Pythia 1B) and knowledge-distilled baselines, with LoRA augmentation enabling near-recovery of full-size performance (Bae et al., 28 Oct 2024).
In SReT, adding modest recursion and sliced group-attention increases ImageNet accuracy by up to 5.8% while reducing parameter and FLOPs requirements (Shen et al., 2021).

TRMs exhibit a characteristic trade-off between parameter thrift and heightened compute per sample when looped at high spatial or sequence resolution, a phenomenon mitigated by appropriate downsampling scheduling or sliced-group self-attention (Coiffier et al., 2020, Shen et al., 2021).

5. Advanced Adaptation, Curriculum, and Inference Methods

Recent TRM research exploits their unique structure for specialized training and deployment strategies:

Test-time adaptation: Fully fine-tuning both the transformer trunk and per-augmentation task embeddings on new tasks (for instance, in the ARC Prize) produces the only competitive accuracy under strict compute budgets; LoRA or embedding-only adaptation is significantly inferior (McGovern, 4 Nov 2025).
Per-augmentation embeddings: Essential for performance in ARC-style tasks; attempts to encode augmentations more compactly degrade results, suggesting that overparameterized embedding tables act as strong stabilizers and regularizers (McGovern, 4 Nov 2025).
Continuous depth-wise batching: Permits dynamic GPU utilization in sequence models, achieving up to 2–3× throughput gains when combined with early exiting (Bae et al., 28 Oct 2024).
Discrete diffusion targets & monotonic curricula: Ensuring incremental improvement across recursion steps outperforms LLM-generated intermediate policies in difficult tasks, such as ARC (Asadulaev et al., 21 Nov 2025).

Limitations include a risk of under- or over-computation in fixed-depth schemes and sensitivity to hyperparameters governing recursion depth and supervision schedules.

6. Theoretical Analysis and Design Trade-offs

TRMs are theoretically motivated by the observation that depth-wise SVD of layer residuals allows low-rank adapters to interpolate between fully recursive and untied extremes (Bae et al., 28 Oct 2024). Parameter-count analysis in vision TRMs (ThriftyNet/SReT) demonstrates how maximal parameter factorization achieves controlled expansion in receptive field or depth with only linear scaling of batch-norm and shortcut coefficients (Coiffier et al., 2020, Shen et al., 2021).

Critical trade-offs in TRM design include:

Recursion depth ( $T$ ) vs. width ( $f$ ): For fixed budgets, increasing depth must be balanced against layer expressivity (Coiffier et al., 2020).
Parameter reuse vs. compute: While memory savings are maximal, repeated application of the same weights at high input resolution is compute-intensive unless downsampling is early and aggressive (Coiffier et al., 2020).
Shortcut and normalization design: Inclusion of residual connections and per-loop normalization is essential for training stability and expressivity (Coiffier et al., 2020, Shen et al., 2021).
Halting mechanisms: Learned ACT heads are empirically less important in high-performing TRMs; simple heuristics or fixed schedules suffice for most reasoning domains (Asadulaev et al., 21 Nov 2025, McGovern, 4 Nov 2025).
Embedding scaling: In reasoning TRMs for ARC, the embedding table for task-augmentation pairs dominates total parameter count by over 50×, raising open questions on efficient encoding (McGovern, 4 Nov 2025).

7. Open Problems and Future Extensions

Unresolved questions in TRM research include:

Optimal recursion scheduling and curriculum mechanisms adaptive to task difficulty.
Designing compact yet expressive augmentation and task-embedding strategies for high-variance generalized tasks.
Extending TRMs’ advantage in low-data and compute-limited regimes to generative settings and broader program synthesis.
Introducing discrete latent bottlenecks or richer intermediate target generators for improved curriculum and efficiency (Asadulaev et al., 21 Nov 2025).

A plausible implication is that TRMs will continue to drive advances in highly parameter-efficient agents capable of competitive abstract reasoning and rapid adaptation on modest hardware, provided future work resolves the open questions regarding embedding scaling, recursion optimality, and dynamic adaptation strategies.

References:

(Coiffier et al., 2020, Shen et al., 2021, Bae et al., 28 Oct 2024, Jolicoeur-Martineau, 6 Oct 2025, McGovern, 4 Nov 2025, Qasim et al., 11 Nov 2025, Asadulaev et al., 21 Nov 2025)