Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Latent Reasoning (2507.06203v1)

Published 8 Jul 2025 in cs.CL

Abstract: LLMs have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.

Summary

  • The paper proposes a unified framework that formalizes latent reasoning methods to overcome the limitations of explicit chain-of-thought in LLMs.
  • It categorizes approaches into vertical and horizontal recurrence, highlighting innovations in architecture, algorithm design, and interpretability.
  • Empirical results demonstrate significant inference speedups and enhanced reasoning capacity, showcasing the potential for scalable and efficient AI systems.

Latent Reasoning in LLMs: A Comprehensive Survey

This survey provides a systematic and technically rigorous overview of latent reasoning in LLMs, focusing on the shift from explicit, token-level chain-of-thought (CoT) reasoning to multi-step inference performed entirely within the model’s continuous hidden state. The work synthesizes architectural, algorithmic, and interpretability advances, and situates latent reasoning as a critical direction for overcoming the expressive and computational bottlenecks of language-based reasoning.

Motivation and Conceptual Framework

Explicit CoT, which requires models to verbalize intermediate reasoning steps in natural language, has been instrumental in improving both interpretability and accuracy in LLMs. However, this approach is fundamentally limited by the low information bandwidth of discrete tokens (approximately 15 bits per token) compared to the high-dimensional hidden states (e.g., 2560-dimensional FP16 vectors, ~40,960 bits per state). The survey quantifies this gap and argues that latent reasoning—reasoning performed in the model’s internal, continuous representations—offers a path to richer, more efficient, and potentially non-linguistic forms of inference.

The authors introduce a unified mathematical framework for latent reasoning, formalizing both spatial (layer-wise) and temporal (sequence-wise) propagation of information. This framework encompasses a variety of architectures, including transformers with key-value caches, linear attention models, and recurrent neural networks, and provides a basis for categorizing latent reasoning methods.

Taxonomy of Latent Reasoning Approaches

The survey organizes latent reasoning into two principal paradigms:

1. Vertical Recurrence (Activation-Based Methods)

These methods increase computational depth by iteratively refining activations within a fixed set of layers, either through explicit architectural recurrence or training-induced mechanisms.

  • Architectural Recurrence: Universal Transformer, CoTFormer, Recursive Transformer, AlgoFormer, and Recurrent-Depth exemplify models that implement looped or recurrent computation over layers. The field has converged on modular Pre/Loop/Coda architectures, with dynamic or fixed stopping criteria and a trend toward eliminating explicit depth embeddings.
  • Activation with Explicit Hidden-State Feedback: Models such as Coconut and CoTFormer feed hidden states back into the input stream, enabling breadth-first exploration and adaptive depth without increasing parameter count.
  • Training-Induced Recurrence: Techniques like Coconut, CODI, CCOT, and methods using filler, pause, or planning tokens demonstrate that recurrent computation can be induced in standard transformers via curriculum learning, self-distillation, or token manipulation, without architectural changes.
  • Applications: These methods have demonstrated strong performance in algorithmic generalization, symbolic reasoning, and meta-learning, with the ability to extrapolate to harder problem instances by increasing recurrence steps at inference.

2. Horizontal Recurrence (Hidden State-Based Methods)

These approaches expand the temporal dimension, evolving a compressed hidden state over long sequences.

  • Linear-State Recurrence: Models such as Mamba-2, GLA, RWKV-6, and HGRN-2 maintain and update a matrix-valued hidden state, enabling efficient, recurrent-style updates.
  • Gradient-State Recurrence: Methods like DeltaNet, TTT, Titans, and Atlas treat the hidden state as a set of fast-adapting parameters, updated via online optimization (e.g., SGD, Adam-like, or second-order methods). Chunk-wise parallelization is employed to balance expressiveness and computational efficiency.
  • Training-Induced Hidden-State Conversion: Techniques such as SUPRA, MOHAWK, Llamba, LoLCATs, and Liger demonstrate that transformers can be converted into recurrent or state-space models via distillation or low-rank adaptation, achieving competitive performance with significantly reduced training cost.

Mechanistic Interpretability and Theoretical Foundations

The survey provides a detailed analysis of the mechanistic underpinnings of latent reasoning:

  • Layer Specialization: Empirical and interpretability studies reveal that transformer layers specialize for different reasoning functions: shallow layers process local and syntactic information, intermediate layers form reasoning circuits and store knowledge, and deep layers perform output refinement and decision-making. The depth of the network is shown to be a primary bottleneck for reasoning capacity, with the achievable CoT step length scaling linearly with layer count.
  • Information Flow: Attention mechanisms are critical for propagating information across layers, and specialized circuits emerge to support multi-step reasoning.
  • Turing Completeness: Theoretical results establish that transformers are Turing complete under certain conditions, and that CoT reasoning can simulate arbitrary computation via prompt engineering or architectural modifications.

Infinite-Depth Reasoning and Diffusion Models

A significant portion of the survey is devoted to the emerging paradigm of infinite-depth reasoning, particularly as realized by text diffusion models:

  • Masked Diffusion Models: These models iteratively denoise masked or noisy drafts of the entire output sequence, enabling bidirectional context, global planning, and iterative self-correction. Innovations such as dKV-Cache and dLLM-Cache accelerate inference, while models like DoT-SEDD and MGDM extend diffusion to chain-of-thought and multi-granularity reasoning.
  • Embedding-Based Diffusion Models: Operating in continuous embedding space, these models (e.g., Diffusion-LM, Plaid, DoT-Plaid) enable global refinement and controllable generation.
  • Hybrid AR-Diffusion Models: Approaches like DiffuLLaMA, L2D, and Gemini Diffusion combine the strengths of autoregressive and diffusion paradigms, supporting both sequential coherence and bidirectional reasoning.
  • Optimization-Based Perspective: The survey highlights that time (sequence length) can be traded for depth (number of reasoning steps), with models like Infini-attention, TTT, Titans, and implicit fixed-point RNNs enabling unbounded reasoning depth via online optimization or fixed-point iteration.

Numerical Results and Claims

The survey references strong empirical results, such as:

  • Training-induced latent reasoning methods (e.g., CODI) achieving parity with explicit CoT on GSM8K.
  • System-1.5 Reasoning delivering over 20× faster inference on GSM8K while preserving CoT accuracy.
  • Diffusion models (e.g., MGDM) achieving state-of-the-art results on complex planning tasks, and dKV-Cache/dLLM-Cache providing up to 9.1× inference speedup.
  • Titans-S (250M) matching a 1.3B transformer on 1-shot recall after 1M optimization steps, demonstrating the effectiveness of “deeper through time.”

Implications and Future Directions

The survey’s synthesis has several important implications:

  • Practical Deployment: Latent reasoning methods enable more efficient, expressive, and potentially interpretable reasoning in LLMs, with architectural and training-induced approaches offering complementary trade-offs in terms of resource requirements, scalability, and compatibility with existing models.
  • Theoretical Insights: The unification of spatial and temporal recurrence, and the optimization-based view of reasoning, provide a principled foundation for designing future models with unbounded reasoning capacity.
  • Evaluation Challenges: The lack of standardized benchmarks and consistent training methodologies currently limits direct empirical comparison across models. The field would benefit from unified evaluation frameworks.
  • Research Directions: Enhancing the effectiveness of deep layers, developing more efficient infinite-depth architectures, and further integrating diffusion and autoregressive paradigms are identified as promising avenues for advancing AI cognition.

Conclusion

This survey establishes latent reasoning as a foundational paradigm for next-generation LLMs, systematically mapping the technical landscape and identifying key challenges and opportunities. By shifting reasoning into the model’s latent space, researchers can transcend the expressive and computational limits of language-based CoT, paving the way for more powerful, flexible, and efficient AI systems. The integration of mechanistic interpretability, architectural innovation, and optimization-based reasoning will be central to future progress in this domain.

Github Logo Streamline Icon: https://streamlinehq.com