Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

Published 26 May 2026 in cs.LG and cs.AI | (2605.26733v1)

Abstract: Looped LLMs (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces STARS, a training method that applies Jacobian Spectral Radius Regularization to ensure stable recurrent dynamics in LoopLMs.
It highlights an effectiveness-stability trade-off where internal normalization boosts initial performance while causing instability, whereas external normalization offers bounded dynamics at the cost of depth in reasoning.
Empirical results on algorithmic arithmetic and GSM8K demonstrate that STARS significantly reduces performance degradation and improves peak accuracy compared to standard fine-tuning and prior baselines.

Stabilizing Recurrent Dynamical Systems for Scalable Latent Reasoning in Looped LLMs

Motivation and Problem Formulation

Test-time compute scaling is central to enhancing LLM reasoning, but most prior approaches—such as chain-of-thought prompting and candidate search—are bottlenecked by the discrete, sequential nature of token-level reasoning. Looped LLMs (LoopLMs) instead emphasize iterative refinement in latent space using parameter sharing. Ideally, increasing the recurrence depth should monotonically improve or retain task performance, as the latent representations are refined with each step.

However, empirical studies demonstrate that existing LoopLMs suffer from unreliable test-time scaling: performance often peaks at a modest number of recurrent steps and then degrades or collapses entirely as recurrence depth increases. Standard supervised fine-tuning causes overfitting to the training iteration count, hindering LoopLMs' ability to generalize reasoning to deeper recurrences. The need to ensure both reasoning effectiveness and dynamical stability in LoopLMs, particularly under recurrent scaling, motivates a deeper investigation of the latent dynamics governing these architectures.

Latent Reasoning as Discrete Dynamical Systems

The paper rigorously frames LoopLMs as discrete-time dynamical systems in latent space, where the hidden state evolution under recurrence is characterized by attractors and fixed points. In well-behaved settings, reasoning should converge to useful fixed points—stable internal representations that capture the completion of computation. The study distinguishes desirable attractors (informative, stable) from those arising in poorly designed models (degenerate, unstable, or chaotic).

Through experimental diagnostics on controlled algorithmic tasks (e.g., 4-digit addition), the analysis identifies a previously overlooked trade-off: internal normalization schemes (e.g., Pre-Norm) facilitate effective information propagation and initially strong performance but induce state norm divergence and instability under extended recurrence. In contrast, external normalization (e.g., Post-Norm) ensures bounded latent trajectories and stability, but at the cost of underpowered reasoning—typically converging to shallow fixed points that fail at complex tasks.

Key architectural insertions (Prelude and Coda layers) and L2 regularization are shown empirically to be insufficient for resolving the effectiveness-stability dichotomy. Nor do these strategies robustly stabilize latent dynamics under deep recurrence.

Training for Stability and Effectiveness: The STARS Framework

The proposed solution, STARS (Stability-driven Recurrent Scaling), draws on dynamical systems theory and enforces asymptotic stability via Jacobian Spectral Radius Regularization (JSRR). By directly regularizing the spectral radius of the Jacobian of the recurrent block (as per Lyapunov theory), STARS ensures that hidden state trajectories are contractive: perturbations decay and the system converges efficiently to stable fixed points when the spectral radius $\rho(J) < 1$ .

Due to computational constraints (latent state dimensionality $D$ is large), STARS implements an efficient spectral radius estimator using single-step power iteration and Jacobian-vector products, circumventing explicit eigenvalue calculations. This method provides batch-level supervisory signals during training while introducing minimal computation and avoiding gradient instabilities associated with multi-step iteration.

Random loop sampling is integrated with JSRR: the number of recurrent steps per batch is drawn from a distribution, ensuring the model experiences a diverse distribution of loop depths during optimization. The loss is a weighted expectation over these sampled depths, jointly encouraging stability and effectiveness across the latent trajectory's support.

Empirical Evaluation and Main Numerical Results

Evaluations are performed on two fronts: (1) algorithmic arithmetic tasks with untrained Transformers to precisely characterize latent dynamics, and (2) complex, open-domain mathematical reasoning using fine-tuned LoopLMs (specifically, Ouro-1.4B), benchmarked on datasets including GSM8K, MATH500, ASDiv, SVAMP, and AMC23.

Key Results

Arithmetic Tasks: The addition task shows that STARS-trained models exhibit invariant (100%) accuracy irrespective of recurrent step count, and hidden states reliably converge to stable attractors in latent space.
Mathematical Reasoning: On GSM8K, Ouro-1.4B suffers a 20.47% accuracy drop after 8 recurrent steps, while Ouro-1.4B-STARS degrades by only 8.26%. Peak performance is also improved by +4.01% (from 70.46% to 74.18%), indicating that stability regularization does not impede expressivity.
Baseline Comparison: Compared to SFT and established LoopLM baselines (e.g., Huginn, Recurrent-Llama-3.2), Ouro-1.4B-STARS achieves both higher peak and substantially more robust out-of-training-horizon performance.
Ablations: Both random loop sampling and JSRR contribute to improved scaling, but their combination outperforms either strategy in isolation—validating the necessity of joint regularization and exposure to loop-depth variation.

Theoretical and Practical Implications

This work formalizes the trade-off between effectiveness and stability in recurrent architectures for LLMs and asserts that test-time scalable reasoning is only attainable by securing both. STARS establishes a mathematically sound training route for LoopLMs, linking convergence properties of dynamical systems (via spectral radius control) to reliable, scalable inference.

In practice, these findings suggest that future LoopLMs for reasoning tasks—particularly those with open-ended compute budgets—must explicitly regulate hidden-state dynamics for stable attractor convergence. The approach generalizes to other architectures where recurrent computation is leveraged for reasoning, including continual and equilibrium models.

Future Directions

Future research avenues include generalizing STARS beyond mathematical domains, exploring adaptive regularization schedules for spectral radius constraints, and integrating model-based detection of "effective" attractors. Another direction would be leveraging similar stability guarantees in planning or commonsense reasoning, where depth-scaling could unlock qualitatively new behaviors. Extending the analysis to heterogeneous, deeper, or multi-block recurrence designs also remains open.

Conclusion

The study systematically diagnoses the instability of contemporary LoopLMs under test-time scaling and rigorously addresses it via spectral radius regularization and loop-depth augmentation. STARS provides a theoretically principled and empirically validated pathway for building looped LMs capable of stable, scalable latent reasoning across recurrent depths, with clear implications for the design and training of next-generation reasoning architectures (2605.26733).

Markdown Report Issue