Generative Recursive Reasoning Models (GRAM)

Updated 22 May 2026

GRAM is a neural architecture that formulates reasoning as a probabilistic multi-trajectory process, enabling iterative refinement of latent solutions.
It integrates stochastic noise injection and deep supervision, supporting both conditional problem-to-solution mapping and generative reasoning.
Empirical results on tasks like Sudoku and N-Queens demonstrate GRAMs' superior accuracy, scalability, and interpretability compared to deterministic models.

Generative Recursive Reasoning Models (GRAM) constitute a class of neural architectures that model reasoning as probabilistic, multi-trajectory recursive computation, capable of producing multiple hypotheses and refining latent solutions over a sequence of internal steps. Going beyond deterministic recursive reasoning models (RRMs), which follow a fixed latent trajectory toward a single convergent prediction, GRAMs implement stochastic latent dynamics—incorporating alternative strategies and supporting scalable inference by depth (recursion steps) and width (trajectory sampling). These frameworks unify symbolic, structural, and probabilistic paradigms, enabling both conditional (e.g., problem-to-solution) and unconditional (generative) reasoning, and underlie a range of applications including structured prediction, constraint satisfaction, reward modeling, and autonomous scientific discovery (Baek et al., 19 May 2026, Buehler, 14 Jan 2025, Wang et al., 2 Sep 2025, DeVilling, 23 Oct 2025).

1. Foundations and Model Formulation

At the core of GRAMs is the conceptualization of reasoning as stochastic trajectory modeling in a latent space. Formally, let $x$ denote the input (problem statement), $y$ the output (solution or answer), and $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ a sequence of latent reasoning states. The joint distribution is defined as

$p_\theta(y, \tau \mid x) = p(z_0) \cdot \prod_{t=1}^T p_\theta(z_t \mid z_{t-1}, e_x) \cdot p_\theta(y \mid z_T, x)$

where $e_x = f_\mathrm{enc}(x;\theta)$ encodes the input, and $z_0$ is the initial latent state. The stochastic transition $z_t$ is determined by a deterministic proposal $u_t = f_\theta(z_{t-1}, e_x)$ , followed by the injection of Gaussian noise: $\varepsilon_t \sim \mathcal{N}(\mu_\theta(u_t), \sigma^2_\theta(u_t) I) \,,\qquad z_t = u_t + \varepsilon_t$ This enables sampling of multiple potential reasoning paths. For unconditional generation, $x$ is omitted or replaced with a null embedding, yielding $y$ 0 as the marginal over latent trajectories (Baek et al., 19 May 2026).

GRAMs generalize earlier deterministic RRMs (e.g., Looped Transformer, Hierarchical Recurrent Models) by maintaining a distribution over trajectories, rather than collapsing to a single fixed solution. This approach supports both inference-time “width” scaling (sampling many trajectories in parallel) and “depth” scaling (extending the recursion length).

2. Training Objectives and Inference Procedures

Direct optimization of $y$ 1 is intractable due to the need to marginalize over $y$ 2, so GRAMs employ amortized variational inference. An inference network $y$ 3, with analogous Markov structure, samples target-informed stochastic transitions: $y$ 4 Training proceeds by maximizing the ELBO: $y$ 5 Long-recursion memory cost is addressed by truncated backpropagation and deep supervision at intermediate steps, optimizing surrogate losses at each “supervision step” (Baek et al., 19 May 2026).

At inference, GRAM produces multiple trajectories $y$ 6, yielding corresponding outputs $y$ 7, from which the final answer can be selected via majority voting or scored by a learned Latent Process Reward Model (LPRM), $y$ 8 (Baek et al., 19 May 2026). This architecture admits scalable, parallel sampling—substantially expanding coverage and robustness.

3. Algorithmic and Structural Variations

GRAMs support hierarchical recursive architectures, where a fine-grained state $y$ 9 is refined over $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 0 inner steps by a low-level network $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 1, and an abstract state $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 2 is updated by a high-level network $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 3 on the outer steps; $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 4. Stochasticity is usually injected at the high level, shaping overall reasoning paths without disrupting local refinements (Baek et al., 19 May 2026).

The Graph-PReFLexOR framework integrates explicit graph-based and category-theoretic reasoning into a GRAM structure, mapping tasks to knowledge graphs $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 5 and abstract pattern sequences, which are recursively refined through paired generative and critic agents (Buehler, 14 Jan 2025). The system formalizes recursive refinement as

$\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 6

with knowledge expansion realized as iterative graph augmentation $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 7.

Self-training generative models for reward reasoning, such as GRAM-R $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 8, are designed to produce both preference predictions and reward rationales, functioning as foundation reward models applicable across diverse tasks, and supporting reinforcement learning from human feedback (RLHF) (Wang et al., 2 Sep 2025).

4. Analysis of Recursive Dynamics and Mirror Loops

Purely ungrounded recursive self-critique—the generation of successive answers without external evidence—tends toward informational closure, quantified by a sharp decline in mean normalized information change ( $\tau = (z_0 \rightarrow z_1 \rightarrow \cdots \rightarrow z_T)$ 9) per iteration. Empirical evidence reveals a 55% reduction in $p_\theta(y, \tau \mid x) = p(z_0) \cdot \prod_{t=1}^T p_\theta(z_t \mid z_{t-1}, e_x) \cdot p_\theta(y \mid z_T, x)$ 0 between early and late recursion phases, with convergence to an attractor (“mirror loop”) where no further progress is made: $p_\theta(y, \tau \mid x) = p(z_0) \cdot \prod_{t=1}^T p_\theta(z_t \mid z_{t-1}, e_x) \cdot p_\theta(y \mid z_T, x)$ 1 Complementary measures—including embedding drift, 3-gram novelty, and token entropy—corroborate this non-convergence (DeVilling, 23 Oct 2025).

Minimal grounding interventions, such as a single external verification step, restore information flux and disrupt fixed-point convergence, observed as a +28% rebound in $p_\theta(y, \tau \mid x) = p(z_0) \cdot \prod_{t=1}^T p_\theta(z_t \mid z_{t-1}, e_x) \cdot p_\theta(y \mid z_T, x)$ 2 immediately post-intervention. These findings motivate essential design principles for GRAMs:

Online loop detection (mean embedding and $p_\theta(y, \tau \mid x) = p(z_0) \cdot \prod_{t=1}^T p_\theta(z_t \mid z_{t-1}, e_x) \cdot p_\theta(y \mid z_T, x)$ 3-gram novelty thresholds),
Mandatory periodic grounding (retrieval, code execution, oracle queries),
State forking (branching continuations upon loop detection),
Meta-loss penalties on state similarity during training (DeVilling, 23 Oct 2025).

5. Empirical Performance and Evaluation

On structured reasoning and combinatorial search tasks, GRAMs outperform deterministic RRMs and autoregressive models in both accuracy and solution coverage. For instance, on Sudoku-Extreme (9×9), GRAM achieves 97.0% accuracy (N=20, 16 steps), versus 87.4% for the best deterministic RRM (TRM), while on 8×8 N-Queens, GRAM delivers 90.3% coverage and 99.7% solution accuracy, far exceeding deterministic alternatives (Baek et al., 19 May 2026). On unconditional generation benchmarks such as binarized MNIST, recursive depth scaling correlates directly with improved Inception Score and FID.

Graph-PReFLexOR, evaluated on reasoning depth (GPT-4 scoring) and cross-domain scientific discovery, demonstrates both superior logical coherence (9.7/10 vs. 6.4/10 baseline) and the ability to generate explicit, interpretable abstractions (patterns and knowledge graphs), supporting autonomous hypothesis generation and interdisciplinary knowledge expansion (Buehler, 14 Jan 2025).

GRAM-R $p_\theta(y, \tau \mid x) = p(z_0) \cdot \prod_{t=1}^T p_\theta(z_t \mid z_{t-1}, e_x) \cdot p_\theta(y \mid z_T, x)$ 4 demonstrates strong performance as a generalist reward model, supporting downstream applications such as response ranking and task-specific reward adaptation while producing explicit reward rationales. It outperforms both discriminative and generative baselines in terms of technical metrics on evaluation benchmarks (Wang et al., 2 Sep 2025).

6. Applications, Generalizations, and Limitations

GRAMs have been deployed for symbolic structure generation, constraint satisfaction (e.g., Sudoku, N-Queens, Graph Coloring), visual reasoning (ARC-AGI), foundation reward modeling, and open-ended scientific discovery. The “knowledge garden growth” strategy exemplifies using GRAMs to autonomously expand domain knowledge through recursive graph building and cross-domain prompting (Buehler, 14 Jan 2025).

Ablation studies establish that both stochasticity and architectural guidance are critical—removal of noise injection or use of naive sampling causes collapse or no improvement over deterministic baselines. Width scaling (multi-trajectory sampling) and deep supervision are generic, transferable enhancements to recursive cores (Baek et al., 19 May 2026).

Known limitations include training cost (sequential deep supervision, truncated backpropagation), which may constrain scalability, and the open problem of designing richer variational families or more efficient surrogates for complex latent trajectory spaces. The epistemic stasis induced by pure self-critique remains a unique challenge, demanding hybrid architectures coupling recursive inference with external grounding (DeVilling, 23 Oct 2025).

7. Outlook and Theoretical Implications

By recasting recursion as stochastic generative process over latent states, GRAMs engage a probabilistic, multi-solution reasoning paradigm, natively quantifying uncertainty, exploring diverse hypotheses, and supporting scalable, parallelizable inference. Theoretical synthesis with graph-based reasoning and category theory (graph isomorphisms, functorial mappings) in frameworks like Graph-PReFLexOR opens pathways to interpretable, interdisciplinary, and autonomous reasoning systems (Buehler, 14 Jan 2025).

A plausible implication is that GRAMs will underpin future reasoning-first, foundation models for science, invention, and knowledge engineering—where open-ended hypothesis generation, structured knowledge expansion, and grounded epistemic reflection are all accessible within a unified, scalable architecture (Baek et al., 19 May 2026, Wang et al., 2 Sep 2025, DeVilling, 23 Oct 2025, Buehler, 14 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Generative Recursive Reasoning (2026)

In-situ graph reasoning and knowledge expansion using Graph-PReFLexOR (2025)

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning (2025)

The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Recursive Reasoning Models (GRAM).