Recursive Stem Model: Stable Operator Learning

Updated 4 July 2026

The paper introduces RSM as a stable operator learner that reframes recursive reasoning by optimizing final output correctness after iterative refinement.
RSM is defined by its dual latent states, detached training for early iterations, and independent inner and outer recursion depths, enabling scalable inference.
Key experiments on Sudoku and Maze tasks demonstrate faster training and improved accuracy compared to prior models like TRM.

Searching arXiv for the target paper and closely related recursive reasoning work to ground the article. arxiv_search(query="Recursive Stem Model (Hakimi, 3 Mar 2026) OR \"Form Follows Function: Recursive Stem Model\" OR recursive reasoning TRM HRM", max_results=10, sort_by="relevance") Recursive Stem Model (RSM) is a recursive reasoning architecture introduced for verifier-rich, compute-heavy tasks such as Sudoku and Maze solving. It preserves a TRM-style, weight-shared recursive backbone, but changes the training contract so that the network learns a stable, depth-agnostic transition operator rather than a depth-specific supervised trajectory. In the formulation reported for RSM, hidden-state history is fully detached during training, early recursive iterations are treated as detached warm-up steps, and the loss is applied only at the final step; this is intended to make recursion itself a test-time compute dial and to permit inference at depths far beyond those used during training (Hakimi, 3 Mar 2026).

1. Origins and design objective

RSM is positioned against prior recursive reasoning models, especially Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). Those models established that small, weight-shared networks can solve difficult puzzle tasks through latent iterative refinement, but their training typically relied on deep supervision, long unrolls, or gradient schemes tied to explicit trajectory depth. The RSM formulation identifies these choices as sources of increased wall-clock cost, memory pressure, vanishing or exploding gradient exposure, and bias toward greedy intermediate behavior (Hakimi, 3 Mar 2026).

The central design move in RSM is therefore not a wholesale replacement of the recursive backbone, but a change in what the model is asked to learn. Instead of optimizing many intermediate recursion depths, RSM optimizes correctness after refinement. The paper explicitly frames the target as a stable, depth-agnostic transition operator that can be reused across many refinement steps and remain effective when iterated far beyond the training rollout length. In the authors’ own characterization, RSM is best understood as a TRM-like recursive reasoning model retrained as an operator learner (Hakimi, 3 Mar 2026).

This places RSM near a broader family of recursive latent-state reasoners. A related taxonomy later describes TRM as a special case of a Recursive Inference Machine with shared Solver and Generator backbones and identity reweighting, which helps situate RSM as a training-contract modification of the recursive-model line rather than a departure from it (Komisarczyk et al., 5 Mar 2026). A plausible implication is that RSM’s novelty lies less in adding new module types than in altering the optimization geometry of an existing recursive architecture.

2. Latent architecture and recursive computation

RSM is a two-state recursive latent dynamical system with a fast inner latent state $Z_L$ and a slower outer latent state $Z_H$ . The model operates on an input sequence $x \in \{1,\dots,V\}^S$ with target $y \in \{1,\dots,V\}^S$ . Token embeddings are written as

$e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$

and optional learned puzzle embeddings $p$ may be prepended: $\tilde e(x)= [p; e(x)] \in \mathbb{R}^{B \times (S_p+S)\times d}.$ The total sequence length is $S_{\text{tot}} = S_p + S$ , and both latent states satisfy

$Z_H, Z_L \in \mathbb{R}^{B \times S_{\text{tot}} \times d}.$

Initialization uses persistent vectors $h_0,\ell_0$ : $Z_H$ 0 The shared transition module is

$Z_H$ 1

This shared module is the operational meaning of the paper’s “recursive stem”: one reusable backbone is repeatedly reapplied to refine latent state (Hakimi, 3 Mar 2026).

RSM has two recursive depths. The inner depth $Z_H$ 2 controls repeated refinement of $Z_H$ 3; the outer depth $Z_H$ 4 controls repeated refinement of $Z_H$ 5. The inner update is

$Z_H$ 6

and the outer update is

$Z_H$ 7

An indexed form makes the nesting explicit: $Z_H$ 8

$Z_H$ 9

$x \in \{1,\dots,V\}^S$ 0

Final prediction is decoded from the terminal outer state: $x \in \{1,\dots,V\}^S$ 1

Two backbone instantiations are described. For Maze, RSM uses a non-causal attention variant with RoPE, SwiGLU, RMSNorm, and residual connections. For Sudoku, it typically uses an MLP token-mixing variant without positional encodings. This suggests that the recursive stem is intended as a reusable computation pattern rather than a commitment to one block type (Hakimi, 3 Mar 2026).

3. Training contract: detachment, warm-up, and terminal supervision

The most distinctive element of RSM is its training method. Hidden-state history is fully detached during training: $x \in \{1,\dots,V\}^S$ 2 Earlier recursive iterations still occur in the forward pass, but most of them do not carry temporal gradients. The paper treats these early steps as detached warm-up. Loss is applied only at the final outer step: $x \in \{1,\dots,V\}^S$ 3 No auxiliary loss is placed on intermediate depths (Hakimi, 3 Mar 2026).

This arrangement changes what the model is optimized to do. Rather than rewarding intermediate states for looking locally correct, it rewards the transition operator for improving a state that may already have undergone substantial prior refinement. The paper presents this as a way to learn a reusable local improvement operator instead of a fixed-depth path. The authors explicitly connect this to reduced wall-clock and memory cost, since training no longer requires a long backpropagation-through-time graph (Hakimi, 3 Mar 2026).

RSM also grows outer depth $x \in \{1,\dots,V\}^S$ 4 and inner depth $x \in \{1,\dots,V\}^S$ 5 independently through milestone-based schedules: $x \in \{1,\dots,V\}^S$ 6

$x \in \{1,\dots,V\}^S$ 7

$x \in \{1,\dots,V\}^S$ 8

During training, the model clamps $x \in \{1,\dots,V\}^S$ 9, ensuring at least one warm-up step; during inference, $y \in \{1,\dots,V\}^S$ 0 is allowed (Hakimi, 3 Mar 2026).

To mitigate instability when increasing depth, the paper introduces a stochastic outer-transition scheme, described as stochastic depth over $y \in \{1,\dots,V\}^S$ 1. Let $y \in \{1,\dots,V\}^S$ 2. Then

$y \in \{1,\dots,V\}^S$ 3

If the penultimate transition is included, gradients span two outer steps; otherwise, the model falls back to strict one-step outer credit assignment. For Sudoku, the paper reports a typical setting

$y \in \{1,\dots,V\}^S$ 4

Additional stabilizers include gradient clipping with default norm $y \in \{1,\dots,V\}^S$ 5, learning-rate warmup plus cosine decay, optional transition LR warmup, optional EMA, and optional optimizer-state scaling or reset when depth grows (Hakimi, 3 Mar 2026).

4. Test-time scaling, settling dynamics, and reliability signal

A central claim of RSM is that it can be trained shallow and deployed deep. The reported training regime uses outer depth around $y \in \{1,\dots,V\}^S$ 6, yet the paper states that inference can be run for roughly $y \in \{1,\dots,V\}^S$ 7 outer steps: $y \in \{1,\dots,V\}^S$ 8 The motivation is that the learned operator should remain useful after arbitrary prior refinement, because training did not couple success to a fixed supervised trajectory (Hakimi, 3 Mar 2026).

The paper interprets the resulting dynamics as an iterative settling process. Repeated application of the shared operator can move latent states toward a stable fixed point, although no theorem proving contraction or global convergence is given. The connection to fixed-point solvers, DEQs, Neural ODE-like iterative depth, and iterative denoising is explicitly conceptual rather than formal. The paper is clear that there are no convergence guarantees and that the model can still oscillate or settle incorrectly (Hakimi, 3 Mar 2026).

At inference time, the authors decode along the rollout,

$y \in \{1,\dots,V\}^S$ 9

and use two practical diagnostics. The first is steps-to-solve, the first outer step $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 0 at which the decoded output satisfies the verifier and remains stable thereafter. The second is a fixed-point or settling check, namely whether decoded outputs stop changing across consecutive steps. The Sudoku visualization is described with the observation that once the model finds the solution, it often “stops changing it” (Hakimi, 3 Mar 2026).

This yields what the paper calls an architecture-native reliability signal. If the trajectory settles and the answer passes a verifier, confidence should be high; if it does not settle, that warns that the model has not reached a viable solution. The paper also stresses the obvious caveat: convergence is not correctness, since a model can converge to a wrong fixed point. Still, in verifier-rich domains, the conjunction of settled and passes verifier is presented as a practical certificate-like signal (Hakimi, 3 Mar 2026).

5. Empirical profile

The reported experimental focus is on Sudoku-Extreme and Maze-Hard ( $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 1), with model sizes of roughly 2.5M–5M parameters. Two headline claims are emphasized: $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 2 faster training than TRM and $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 3 reduction in error rate (Hakimi, 3 Mar 2026).

On Sudoku-Extreme, RSM reaches 97.5% exact accuracy with test-time compute, within roughly 1 hour of training on a single A100. The paper contrasts this with a previous TRM state of the art of about 87% on Sudoku with 12 hours of training. The reported scaling trend is especially important: the model is trained only to shallow depth, but increasing outer cycles $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 4 at test time substantially improves solve rate, and the figure description explicitly states that increasing $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 5 does not cause a drop in accuracy. The solve-rate progression reported in the figure description includes values such as 67.7, 88.7, 96.8, and 97.5, which the text uses to support the conclusion that test-time recursion materially improves Sudoku performance (Hakimi, 3 Mar 2026).

On Maze-Hard ( $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 6), the attention-based variant reaches about 80% exact accuracy in roughly 40 minutes. The visual description suggests that the model often first finds a rough path and then refines it over subsequent recursion steps. The paper speculates that Maze performance may have been constrained more by training-data size and overfitting than by architectural limits, but this remains an interpretation rather than a controlled ablation (Hakimi, 3 Mar 2026).

The paper also reports qualitative evidence that many puzzles solve before the maximum test depth, which supports the claim that the model is genuinely settling rather than merely consuming extra compute. At the same time, the experimental section is described as lightweight and exploratory: the total experiment budget was only about \$50 on Google Colab, and the authors explicitly state that the work is not a comprehensive ablation study (Hakimi, 3 Mar 2026).

6. Relation to neighboring recursive models and acronym usage

RSM belongs to a rapidly developing literature on recursive latent computation, but it occupies a specific position within that landscape. A related formalism, Recursive Inference Machines, treats TRM as a special case with shared Solver and Generator backbones and identity reweighting, which helps clarify that RSM is best read as a refinement of the TRM line rather than a competing umbrella framework (Komisarczyk et al., 5 Mar 2026). Another neighboring direction, Generative Recursive reAsoning Models (GRAM), extends recursive latent reasoning into a stochastic latent-variable framework with multiple trajectories and variational training, thereby adding a probabilistic axis that RSM itself does not include (Baek et al., 19 May 2026). Recursive scaling has also been explored in masked diffusion, where repeated application of a shared denoising block is treated as a third scaling axis beyond parameter count and denoising steps (Carballo-Castro et al., 16 Jun 2026). More general recursive long-horizon reasoning systems have been formulated in terms of explicit call/return scaffolds over recursive subtasks, emphasizing bounded active context rather than verifier-rich latent settling (Yang et al., 2 Mar 2026).

A recurring source of confusion is the acronym RSM itself. In other arXiv contexts it refers to Reverse Sequence Mutation in genetic algorithms for TSP (Abdoun et al., 2012), Reusable Slotwise Mechanisms in object-centric world modeling (Nguyen et al., 2023), Regime-Switching Model for exoplanet detection in high-contrast imaging (Dahlqvist et al., 2020), and Recursive State Machines in program analysis (Chatterjee et al., 2017). None of those is the same construct as Recursive Stem Model.

Within recursive reasoning specifically, the distinctive signature of RSM is therefore narrow but clear: a small, heavily weight-shared recursive backbone; two latent states $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 7 and $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 8; full hidden-state detachment over history during training; detached warm-up; terminal-only supervision; independent growth of $e(x) = \mathrm{Embed}(x) \in \mathbb{R}^{B \times S \times d},$ 9 and $p$ 0; and explicit use of settling behavior as a practical reliability signal (Hakimi, 3 Mar 2026). This suggests that RSM’s enduring significance, if the empirical claims generalize, will lie in reframing recursive reasoning as stable operator learning rather than deeply supervised trajectory learning.