Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Latent Reasoning (2507.06203v1)

Published 8 Jul 2025 in cs.CL

Abstract: LLMs have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.

Summary

  • The paper quantifies the gap between explicit chain-of-thought and latent reasoning, revealing a ~2,700-fold increase in expressive capacity.
  • The paper presents a rigorous taxonomy of vertical and horizontal recurrence methods that enhance algorithmic reasoning and generalization in LLMs.
  • The paper demonstrates that innovative diffusion and optimization approaches enable efficient, infinite-depth reasoning for complex inference tasks.

Latent Reasoning in LLMs: A Comprehensive Survey

This survey provides a systematic and technically rigorous overview of latent reasoning in LLMs, focusing on the shift from explicit, token-level chain-of-thought (CoT) reasoning to multi-step inference performed entirely within the model’s continuous hidden state. The work synthesizes architectural, algorithmic, and interpretability advances, and situates latent reasoning as a central paradigm for overcoming the expressive and computational bottlenecks of language-based reasoning.

Motivation and Conceptual Framework

Explicit CoT, which requires models to verbalize intermediate reasoning steps in natural language, has been instrumental in improving both interpretability and accuracy in LLMs. However, this approach is fundamentally limited by the low information bandwidth of discrete tokens (approximately 15 bits per token) compared to the high-dimensional hidden states (e.g., 2560 FP16 dimensions, or ~40,960 bits per step). The survey quantifies this gap, highlighting a ~2,700-fold difference in expressive capacity between explicit and latent reasoning.

Latent reasoning, by contrast, internalizes the reasoning process within the model’s hidden activations, enabling richer, non-linguistic inference trajectories and potentially more efficient computation. The survey formalizes latent reasoning as a set of spatial (layer-wise) and temporal (sequence-wise) transformations on hidden states, and provides a unified mathematical framework for both transformer-based and diffusion-based models.

Taxonomy of Latent Reasoning Approaches

The survey introduces a detailed taxonomy, dividing latent reasoning into two principal paradigms:

1. Vertical Recurrence (Activation-Based Methods)

These methods expand computational depth by iteratively refining activations within a fixed or dynamically determined set of layers. Key architectural and training strategies include:

  • Loop/Universal Transformer Recurrence: Models such as Universal Transformer, CoTFormer, Recursive Transformer, AlgoFormer, and Recurrent-Depth implement explicit or implicit layer-wise recurrence, often with dynamic stopping criteria. The field has converged on modular Pre/Loop/Coda architectures, simplifying depth embeddings and dynamic stopping mechanisms.
  • Activation with Explicit Hidden-State Feedback: Approaches like Coconut and CoTFormer feed hidden activations back into the input stream, enabling breadth-first exploration and adaptive depth without increasing parameter count.
  • Training-Induced Recurrence: Methods such as Coconut, CODI, CCOT, and System-1.5 Reasoning induce recurrent computation through curriculum learning, self-distillation, or strategic token insertion (e.g., pause, filler, or planning tokens), without architectural changes.
  • Applications: These methods have demonstrated strong generalization in algorithmic tasks, symbolic reasoning, and meta-learning, with evidence that recurrence—whether architectural or induced—enables extrapolation to more complex problems by increasing effective reasoning depth.

2. Horizontal Recurrence (Hidden State-Based Methods)

These approaches focus on evolving a compressed hidden state over long sequences, trading temporal expansion for depth:

  • Linear-State Recurrence: Models such as Mamba-2, GLA, RWKV-6, and HGRN-2 maintain and update matrix-valued hidden states using associative operations, enabling efficient memory and context propagation.
  • Gradient-State Recurrence: Recent work (e.g., DeltaNet, TTT, Titans, Atlas) interprets hidden state updates as online optimization steps, with each token performing a gradient update on the state. Chunk-wise parallelization is employed to balance expressiveness and computational efficiency.
  • Training-Induced Hidden-State Conversion: Techniques like SUPRA, MOHAWK, Llamba, LoLCATs, and Liger distill pretrained transformers into recurrent or state-space models, achieving competitive performance with a fraction of the original training compute.

Mechanistic Interpretability and Theoretical Foundations

The survey rigorously addresses the mechanistic underpinnings of latent reasoning:

  • Layer Specialization: Empirical and interpretability studies reveal a clear division of labor across layers: shallow layers process local and factual information, intermediate layers form reasoning circuits and bridge entities, and deep layers perform output integration and decision-making. The depth of the network is shown to be a primary bottleneck for reasoning capacity, with the achievable CoT step length scaling linearly with layer count.
  • Information Flow: Attention mechanisms are identified as critical for propagating information across layers, and interventions on intermediate layer activations have a decisive impact on reasoning outcomes.
  • Turing Completeness: The survey synthesizes theoretical results demonstrating that both recurrent and transformer architectures are Turing complete under reasonable assumptions, and that CoT reasoning can endow fixed-depth models with universal computational capacity via prompt engineering.

Infinite-Depth Reasoning and Diffusion Models

A significant portion of the survey is devoted to the emerging paradigm of infinite-depth reasoning, primarily realized through text diffusion models:

  • Masked Diffusion Models (MDMs): These models iteratively denoise masked sequences in parallel, enabling bidirectional context, global planning, and iterative self-correction. Innovations such as dKV-Cache and dLLM-Cache accelerate inference, while DoT-SEDD and MGDM extend the framework to chain-of-thought and multi-granularity reasoning.
  • Embedding-Based Diffusion Models: By operating in continuous embedding space, these models (e.g., Diffusion-LM, Plaid, DoT-Plaid) enable global refinement and controllable generation, with scaling laws closing the efficiency gap with autoregressive models.
  • Hybrid AR-Diffusion Models: Approaches like DiffuLLaMA, L2D, and Gemini Diffusion integrate autoregressive and diffusion paradigms, leveraging the strengths of both for complex reasoning and planning tasks.
  • Optimization-Based Perspective: The survey unifies these approaches under the principle that depth can be traded for time, with longer sequences or more diffusion steps yielding greater reasoning depth. Techniques such as infini-attention, test-time training (TTT), and implicit fixed-point RNNs exemplify this trade-off, enabling models to process million-token contexts with near-linear computational cost.

Numerical Results and Empirical Claims

The survey highlights several strong empirical results:

  • Bandwidth Gap: Latent reasoning operates at a ~2,700-fold higher information bandwidth than explicit CoT.
  • Performance Parity: Training-induced latent reasoning methods (e.g., CODI, CCOT) achieve parity with explicit CoT on benchmarks such as GSM8K.
  • Efficiency Gains: System-1.5 Reasoning delivers over 20× faster inference on GSM8K while preserving CoT accuracy, without architectural changes.
  • Diffusion Model Scaling: MDMs match or exceed the performance of larger autoregressive models on language understanding and mathematical reasoning tasks, with up to 10× faster sampling and improved robustness.

Implications and Future Directions

The survey’s synthesis has several important implications:

  • Expressive Power: Latent reasoning fundamentally expands the expressive and computational capacity of LLMs, enabling reasoning strategies unconstrained by language.
  • Architectural Flexibility: Both architectural and training-induced recurrence are viable, and hybrid approaches may yield further gains.
  • Interpretability: Mechanistic interpretability remains a challenge, as latent reasoning is less transparent than explicit CoT, but advances in probing and circuit analysis are beginning to reveal the internal structure of reasoning processes.
  • Scalability: Infinite-depth reasoning via diffusion and optimization-based methods offers a path to unbounded computation, with practical strategies for efficient training and inference.
  • Evaluation: The lack of standardized benchmarks and consistent training methodologies currently limits direct empirical comparison across models; unified evaluation frameworks are needed.

Speculation on Future Developments

Future research is likely to focus on:

  • Unified Reasoning Architectures: Integrating vertical and horizontal recurrence, diffusion, and optimization-based methods into cohesive frameworks.
  • Efficient Training and Distillation: Further reducing the compute required to endow models with latent reasoning capabilities, especially via distillation from large foundation models.
  • Interpretability and Control: Developing tools for interpreting and steering latent reasoning processes, including mechanisms for extracting or constraining internal reasoning traces.
  • Application Domains: Deploying latent reasoning models in domains requiring complex, multi-step inference, such as scientific discovery, formal verification, and autonomous agents.

In summary, this survey establishes latent reasoning as a foundational paradigm for next-generation AI systems, providing both a comprehensive technical taxonomy and a roadmap for future research in model cognition and reasoning.

Github Logo Streamline Icon: https://streamlinehq.com