Generative Recursive Reasoning Models (GRAM)
- GRAM is a family of probabilistic, multi-trajectory neural architectures that use recursive latent-state refinement for both conditional reasoning and unconditional generation.
- It employs a hierarchical process with stochastic high-level transitions and deterministic low-level steps to outperform baseline models on tasks like Sudoku and graph coloring.
- Key design principles include loop detection, mandatory grounding, and recursive self-training for reward modeling, addressing failures like epistemic stasis.
Generative Recursive reAsoning Models (GRAM) are a family of probabilistic, multi-trajectory neural architectures for extended computation and structured generation. The GRAM framework introduces stochastic, recursive latent-state refinement as a generative process, supporting both conditional reasoning and unconditional data generation. Recent advances in this domain demonstrate that recursive generative reasoning models outperform deterministic and purely autoregressive baselines on complex tasks that require multiple, alternative solution trajectories, integrated uncertainty, and explicit epistemic rationales (Baek et al., 19 May 2026, Wang et al., 2 Sep 2025, DeVilling, 23 Oct 2025).
1. Formal Definition and Recursive Process
GRAM implements reasoning as a latent-variable generative process over a recursive trajectory
with the total number of latent transition steps. The process is factorized as:
- Initial prior: , typically a fixed learned constant or a sample from .
- Recursive transitions: For ,
where is a (potentially empty) encoding of input .
A hierarchical structure is adopted, decomposing each into high-level () and low-level (0) states, updated via a sequence of deterministic low-level steps and a stochastic high-level transition:
- Low-level: 1 for 2; set 3
- High-level: 4, 5, 6
For conditional reasoning, the model outputs 7 by marginalizing over 8: 9 using a decoder 0. For unconditional generation, 1 is empty and 2 is decoded directly (Baek et al., 19 May 2026).
2. Inference and Training
Inference uses amortized variational techniques. The variational posterior,
3
shares structure with the prior but conditions on the target 4. The evidence lower bound (ELBO) optimized during training is: 5 with KL divergence decomposed across time. Deep supervision is applied every 6 steps, and gradients are truncated across supervision boundaries for computational efficiency.
Additional architectural components include:
- Adaptive computation time (ACT): a temporal-difference loss for learned halting.
- Latent Process Reward Model (LPRM): a value head 7, predicting task-specific rewards via MSE.
Inference-time scaling is achieved by increasing recursion depth (8), using ACT halting, or sampling multiple stochastic trajectories in parallel, with outputs selected by majority vote or LPRM value (Baek et al., 19 May 2026).
3. Experimental Results and Empirical Performance
GRAM demonstrates state-of-the-art results on structured reasoning and constraint satisfaction tasks:
- Sudoku-Extreme: GRAM achieves 97.0% accuracy, vs. 87.4% for TRM and 55.0% for HRM.
- ARC-AGI-1: 52.0% (GRAM) vs. 44.6% (TRM).
- N-Queens (8×8): GRAM 99.7% (single-sample), 90.3% coverage (20 samples), outperforming AR Transformer, MDLM, and TRM.
- Graph Coloring: GRAM resolves 8-node graphs with 2.7 conflicts, outperforming AR’s 19.0 conflicts; 51.3% coverage on 10-node graphs.
In unconditional generation:
- Sudoku puzzle generation: GRAM (10.9 M params, 16 steps) produces 99.05% valid, near-unique puzzles—significantly exceeding D3PM-large at 91.33%.
- Binarized MNIST: GRAM achieves an Inception Score of 1.99 (vs. VAE’s 1.70) and FID of 74.30.
Distinctive findings include monotonic improvement in reasoning/generation quality with increasing recursion depth and parallel sample count, and ablation showing that learned stochastic guidance is necessary (naive noise injection fails) (Baek et al., 19 May 2026).
4. Connections to Recursive Generative Reasoning and the Mirror Loop
Recursive generative reasoning defines a discrete dynamical system over state-sequence trajectories: 9 with 0 representing external evidence. The ungrounded mirror loop corresponds to 1, where reflection occurs solely via internal reformulation. Experimental analysis reveals a rapid decline of mean informational change (normalized edit distance 2) during recursive steps, converging to an “attractor of epistemic stasis” with 3. This regime is marked by:
- Decline in metrics such as 4, embedding drift, n-gram novelty, and character-level entropy.
- Empirical saturation across varied LLMs (GPT-4o-mini: 5 reduction in 6; Claude 3 Haiku: 7; Gemini 2.0 Flash: 8).
- Only minimal grounding (e.g., an external verification at step 3) produces a 9 rebound in 0 and restores sustained variance (DeVilling, 23 Oct 2025).
This suggests that unguided recursion induces a regime where generative reasoning becomes performative rather than epistemically informative, with progress reliant on deliberate injection of external evidence.
5. Design Principles, Failure Modes, and Interventions
Design recommendations for effective GRAM systems, derived from empirical and theoretical analysis, include:
- Loop Detection: Monitor metrics such as embedding drift and n-gram novelty over a rolling window, flagging when both drop below calibrated thresholds (1).
- Mandatory Grounding: Incorporate external verification (retrieval, code execution, oracle check) at least every 2 iterations (3 is effective empirically).
- State Forking: Upon loop detection, spawn multiple continuations using varied sampling or prompt augmentation, selecting the most divergent.
- Meta-Loss Penalties: Apply explicit penalties in training (or RLHF) to consecutive hidden states with high cosine similarity, discouraging fixed-point convergence.
- Dissipative Inference Architectures: Structure the reasoning pipeline so every reflective pass is coupled with external informational influx, ensuring ongoing epistemic acquisition rather than recursive closure.
These principles are essential to convert syntactic recursion into genuine epistemic reflection and prevent the collapse into the mirror loop state (DeVilling, 23 Oct 2025).
6. GRAM as Foundation and Reward Models
The GRAM paradigm extends to reward modeling, as illustrated by GRAM-R4, a generative foundation reward model. GRAM-R5 produces both preference labels and explicit, natural-language reward rationales. The training regime leverages a recursive self-training loop:
- A “preference-proving” module synthesizes rationales for labeled pairs.
- GRAM-R6 is then iteratively self-trained on unlabeled data by generating pseudo-labels and rationales, filtering low-confidence or generic outputs using a Bayesian-justified scoring mechanism.
Empirical results on RM-Bench, JudgeBench, and coding/math response ranking show that GRAM-R7 achieves 85.7% accuracy (LLaMA-3.1-8B backbone), outperforming discriminative (76.0%) and generative no-rationale (79.2%) baselines. Downstream, GRAM-R8 accelerates RLHF adaptation and achieves higher win rates in human feedback–driven tuning (Wang et al., 2 Sep 2025).
A plausible implication is that by folding explicit reasoning and rationale generation directly into the reward modeling process, GRAMs can serve as generalist reward models requiring significantly less human-labeled preference data.
7. Limitations and Directions for Future Research
Current GRAM training requires deep supervision with truncated BPTT, which is less efficient (both compute and memory) than standard feedforward Transformer architectures. Scaling to very large backbones and to open-domain or multi-turn reasoning remains challenging, with open questions on variational posterior calibration and reward model selection. The recursive self-training procedure for reward models, while empirically effective, lacks formal convergence analysis.
Potential future avenues include:
- Efficient scalable approximations for training and inference.
- Theorizing the properties and convergence behavior of recursive self-training procedures.
- Integration of richer reasoning structures, e.g., multi-step RL objectives and contrastive filtering.
- Extending GRAMs to domains such as multi-turn dialogue, real-world planning, and unsupervised structured data generation (Baek et al., 19 May 2026, Wang et al., 2 Sep 2025, DeVilling, 23 Oct 2025).