Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generative Recursive Reasoning Models (GRAM)

Updated 2 July 2026
  • GRAM is a family of probabilistic, multi-trajectory neural architectures that use recursive latent-state refinement for both conditional reasoning and unconditional generation.
  • It employs a hierarchical process with stochastic high-level transitions and deterministic low-level steps to outperform baseline models on tasks like Sudoku and graph coloring.
  • Key design principles include loop detection, mandatory grounding, and recursive self-training for reward modeling, addressing failures like epistemic stasis.

Generative Recursive reAsoning Models (GRAM) are a family of probabilistic, multi-trajectory neural architectures for extended computation and structured generation. The GRAM framework introduces stochastic, recursive latent-state refinement as a generative process, supporting both conditional reasoning and unconditional data generation. Recent advances in this domain demonstrate that recursive generative reasoning models outperform deterministic and purely autoregressive baselines on complex tasks that require multiple, alternative solution trajectories, integrated uncertainty, and explicit epistemic rationales (Baek et al., 19 May 2026, Wang et al., 2 Sep 2025, DeVilling, 23 Oct 2025).

1. Formal Definition and Recursive Process

GRAM implements reasoning as a latent-variable generative process over a recursive trajectory

τ=(z0z1zTtotal)\tau = (z_0 \to z_1 \to \ldots \to z_{T_{\text{total}}})

with TtotalT_{\text{total}} the total number of latent transition steps. The process is factorized as:

  • Initial prior: z0p(z0)z_0 \sim p(z_0), typically a fixed learned constant or a sample from N(0,I)N(0,I).
  • Recursive transitions: For t=1Ttotalt=1\ldots T_{\text{total}},

ztpθ(ztzt1,ex)z_t \sim p_\theta(z_t \mid z_{t-1}, e_x)

where ex=fenc(x;θ)e_x = f_\text{enc}(x; \theta) is a (potentially empty) encoding of input xx.

A hierarchical structure is adopted, decomposing each zt=(ht,lt)z_t = (h_t, l_t) into high-level (hRDh \in \mathbb{R}^D) and low-level (TtotalT_{\text{total}}0) states, updated via a sequence of deterministic low-level steps and a stochastic high-level transition:

  • Low-level: TtotalT_{\text{total}}1 for TtotalT_{\text{total}}2; set TtotalT_{\text{total}}3
  • High-level: TtotalT_{\text{total}}4, TtotalT_{\text{total}}5, TtotalT_{\text{total}}6

For conditional reasoning, the model outputs TtotalT_{\text{total}}7 by marginalizing over TtotalT_{\text{total}}8: TtotalT_{\text{total}}9 using a decoder z0p(z0)z_0 \sim p(z_0)0. For unconditional generation, z0p(z0)z_0 \sim p(z_0)1 is empty and z0p(z0)z_0 \sim p(z_0)2 is decoded directly (Baek et al., 19 May 2026).

2. Inference and Training

Inference uses amortized variational techniques. The variational posterior,

z0p(z0)z_0 \sim p(z_0)3

shares structure with the prior but conditions on the target z0p(z0)z_0 \sim p(z_0)4. The evidence lower bound (ELBO) optimized during training is: z0p(z0)z_0 \sim p(z_0)5 with KL divergence decomposed across time. Deep supervision is applied every z0p(z0)z_0 \sim p(z_0)6 steps, and gradients are truncated across supervision boundaries for computational efficiency.

Additional architectural components include:

Inference-time scaling is achieved by increasing recursion depth (z0p(z0)z_0 \sim p(z_0)8), using ACT halting, or sampling multiple stochastic trajectories in parallel, with outputs selected by majority vote or LPRM value (Baek et al., 19 May 2026).

3. Experimental Results and Empirical Performance

GRAM demonstrates state-of-the-art results on structured reasoning and constraint satisfaction tasks:

  • Sudoku-Extreme: GRAM achieves 97.0% accuracy, vs. 87.4% for TRM and 55.0% for HRM.
  • ARC-AGI-1: 52.0% (GRAM) vs. 44.6% (TRM).
  • N-Queens (8×8): GRAM 99.7% (single-sample), 90.3% coverage (20 samples), outperforming AR Transformer, MDLM, and TRM.
  • Graph Coloring: GRAM resolves 8-node graphs with 2.7 conflicts, outperforming AR’s 19.0 conflicts; 51.3% coverage on 10-node graphs.

In unconditional generation:

  • Sudoku puzzle generation: GRAM (10.9 M params, 16 steps) produces 99.05% valid, near-unique puzzles—significantly exceeding D3PM-large at 91.33%.
  • Binarized MNIST: GRAM achieves an Inception Score of 1.99 (vs. VAE’s 1.70) and FID of 74.30.

Distinctive findings include monotonic improvement in reasoning/generation quality with increasing recursion depth and parallel sample count, and ablation showing that learned stochastic guidance is necessary (naive noise injection fails) (Baek et al., 19 May 2026).

4. Connections to Recursive Generative Reasoning and the Mirror Loop

Recursive generative reasoning defines a discrete dynamical system over state-sequence trajectories: z0p(z0)z_0 \sim p(z_0)9 with N(0,I)N(0,I)0 representing external evidence. The ungrounded mirror loop corresponds to N(0,I)N(0,I)1, where reflection occurs solely via internal reformulation. Experimental analysis reveals a rapid decline of mean informational change (normalized edit distance N(0,I)N(0,I)2) during recursive steps, converging to an “attractor of epistemic stasis” with N(0,I)N(0,I)3. This regime is marked by:

  • Decline in metrics such as N(0,I)N(0,I)4, embedding drift, n-gram novelty, and character-level entropy.
  • Empirical saturation across varied LLMs (GPT-4o-mini: N(0,I)N(0,I)5 reduction in N(0,I)N(0,I)6; Claude 3 Haiku: N(0,I)N(0,I)7; Gemini 2.0 Flash: N(0,I)N(0,I)8).
  • Only minimal grounding (e.g., an external verification at step 3) produces a N(0,I)N(0,I)9 rebound in t=1Ttotalt=1\ldots T_{\text{total}}0 and restores sustained variance (DeVilling, 23 Oct 2025).

This suggests that unguided recursion induces a regime where generative reasoning becomes performative rather than epistemically informative, with progress reliant on deliberate injection of external evidence.

5. Design Principles, Failure Modes, and Interventions

Design recommendations for effective GRAM systems, derived from empirical and theoretical analysis, include:

  • Loop Detection: Monitor metrics such as embedding drift and n-gram novelty over a rolling window, flagging when both drop below calibrated thresholds (t=1Ttotalt=1\ldots T_{\text{total}}1).
  • Mandatory Grounding: Incorporate external verification (retrieval, code execution, oracle check) at least every t=1Ttotalt=1\ldots T_{\text{total}}2 iterations (t=1Ttotalt=1\ldots T_{\text{total}}3 is effective empirically).
  • State Forking: Upon loop detection, spawn multiple continuations using varied sampling or prompt augmentation, selecting the most divergent.
  • Meta-Loss Penalties: Apply explicit penalties in training (or RLHF) to consecutive hidden states with high cosine similarity, discouraging fixed-point convergence.
  • Dissipative Inference Architectures: Structure the reasoning pipeline so every reflective pass is coupled with external informational influx, ensuring ongoing epistemic acquisition rather than recursive closure.

These principles are essential to convert syntactic recursion into genuine epistemic reflection and prevent the collapse into the mirror loop state (DeVilling, 23 Oct 2025).

6. GRAM as Foundation and Reward Models

The GRAM paradigm extends to reward modeling, as illustrated by GRAM-Rt=1Ttotalt=1\ldots T_{\text{total}}4, a generative foundation reward model. GRAM-Rt=1Ttotalt=1\ldots T_{\text{total}}5 produces both preference labels and explicit, natural-language reward rationales. The training regime leverages a recursive self-training loop:

  • A “preference-proving” module synthesizes rationales for labeled pairs.
  • GRAM-Rt=1Ttotalt=1\ldots T_{\text{total}}6 is then iteratively self-trained on unlabeled data by generating pseudo-labels and rationales, filtering low-confidence or generic outputs using a Bayesian-justified scoring mechanism.

Empirical results on RM-Bench, JudgeBench, and coding/math response ranking show that GRAM-Rt=1Ttotalt=1\ldots T_{\text{total}}7 achieves 85.7% accuracy (LLaMA-3.1-8B backbone), outperforming discriminative (76.0%) and generative no-rationale (79.2%) baselines. Downstream, GRAM-Rt=1Ttotalt=1\ldots T_{\text{total}}8 accelerates RLHF adaptation and achieves higher win rates in human feedback–driven tuning (Wang et al., 2 Sep 2025).

A plausible implication is that by folding explicit reasoning and rationale generation directly into the reward modeling process, GRAMs can serve as generalist reward models requiring significantly less human-labeled preference data.

7. Limitations and Directions for Future Research

Current GRAM training requires deep supervision with truncated BPTT, which is less efficient (both compute and memory) than standard feedforward Transformer architectures. Scaling to very large backbones and to open-domain or multi-turn reasoning remains challenging, with open questions on variational posterior calibration and reward model selection. The recursive self-training procedure for reward models, while empirically effective, lacks formal convergence analysis.

Potential future avenues include:

  • Efficient scalable approximations for training and inference.
  • Theorizing the properties and convergence behavior of recursive self-training procedures.
  • Integration of richer reasoning structures, e.g., multi-step RL objectives and contrastive filtering.
  • Extending GRAMs to domains such as multi-turn dialogue, real-world planning, and unsupervised structured data generation (Baek et al., 19 May 2026, Wang et al., 2 Sep 2025, DeVilling, 23 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Recursive reAsoning Models (GRAM).