Papers
Topics
Authors
Recent
2000 character limit reached

Markovian Thinking in LLM Reasoning

Updated 9 October 2025
  • Markovian thinking is a paradigm that uses fixed-size states to condition each reasoning step, ensuring scalability and efficiency.
  • The framework, implemented in Delethink, segments reasoning into fixed-length chunks with state compression, decoupling compute and memory from total reasoning length.
  • Empirical results show that models using Markovian thinking achieve competitive accuracy and superior resource efficiency compared to traditional chain-of-thought methods.

Markovian thinking refers to a paradigm in which reasoning processes are structured such that each step conditions only on a constant-size, bounded state rather than all prior history. In the context of reinforcement learning (RL) for reasoning in LLMs, Markovian thinking contrasts sharply with the conventional approach in chain-of-thought (CoT) RL, where the model’s state grows unbounded as the reasoning chain (e.g., the prompt plus all reasoning tokens) extends. The Markovian thinking framework introduces a state that encapsulates just the essential information for continuation, enabling efficient, scalable reasoning with compute and memory costs that remain constant per step—completely decoupled from total reasoning length. This structure facilitates the training of LLMs to reason for very long chains, with empirical results demonstrating that the approach can match or exceed accuracy, scale more flexibly, and deliver dramatic efficiency gains compared to standard methods (Aghajohari et al., 8 Oct 2025).

1. Structural Definition of Markovian Thinking

Markovian thinking, as formally defined in “The Markovian Thinker” (Aghajohari et al., 8 Oct 2025), is a paradigm in which a reasoning LLM advances its chain of thought by conditioning on a fixed-size state at each step. The core attributes are:

  • Fixed Context Window: Reasoning is divided into fixed-size segments (chunks), with the model’s attention and memory limited to the current chunk and a concise carryover of information (“Markovian state”) from the previous one.
  • State Compression: At each chunk boundary, the model must output a textual state of fixed length mm that suffices for seamless continuation after the environment resets the prompt.
  • Markovian Property: The decision at chunk l+1l{+}1 depends only on the current query and the carryover from chunk ll, not on the full historical trace.

This is in contrast to conventional RL environments for CoT, where the environment state comprises the entire history, causing quadratic growth in compute and memory.

2. Practical Implementation: The Delethink Environment

Delethink embodies Markovian thinking by systematically structuring reasoning as a series of fixed-length “chunks”. The implementation details are:

  • Chunking Reasoning: The model is trained to generate up to a context limit C\mathcal{C} (e.g., 8K tokens) for each reasoning segment.
  • Context Reset with Carryover: At each chunk boundary, instead of appending new tokens to an ever-growing prompt, the RL environment resets the prompt to the original query concatenated with the last mm tokens of the preceding chunk. The core transition is:

xl+1=query(l)[m:]x_{l+1} = \text{query} \oplus (\ell_l)[-m:]

where (l)[m:](\ell_l)[-m:] denotes the final mm tokens of the previous segment.

  • RL Training Algorithm: The agent is rewarded according to standard PPO-style RL objectives, summing per-chunk contributions and normalizing by the number of generated tokens. Chunk rollouts terminate upon generation of an end-of-sequence (EOS) token or a fixed maximum number of chunks.
  • Constant-Size Compute and Memory: By construction, Delethink only requires the transformer’s KV cache for a single chunk, eliminating context growth and avoiding exponential memory bloat and throughput degradation.

3. Computational Benefits and Scaling Properties

The Markovian thinking paradigm provides substantial improvements in efficiency and scalability:

Paradigm Compute Scaling Memory Usage Throughput with Length
LongCoT-RL Quadratic in tokens Grows with n Degrades as O(n²)
Delethink (MT) Linear in chunk length Constant (per-chunk) Constant
  • Compute: For output length nn using SS chunks, LongCoT-RL requires O(n2S2)O(n^2S^2) FLOPs, while Delethink/MT requires O(n2S)O(n^2S).
  • Memory: Only a fixed-size cache is needed in Delethink, independent of reasoning horizon.
  • Throughput: Unlike conventional approaches where inference time increases quadratically with sequence length, Delethink maintains near-constant throughput, as generation proceeds chunk-wise.

The paper reports that, at 96K average thinking length, LongCoT-RL training consumed an estimated 27 H100-months, compared to 7 H100-months for Delethink.

4. Empirical Results: Comparative Performance and Scaling

In experimental evaluations, Delethink’s instantiation of Markovian thinking demonstrates performance and scaling advantages:

  • Accuracy: A 1.5B R1-Distill model trained under Delethink (8K chunk size, 24K total tokens) matched or outperformed LongCoT-RL on math reasoning benchmarks.
  • Scaling Beyond Training Window: Delethink-trained models continue to improve at test-time with extended reasoning length (up to 96K tokens), while LongCoT-RL models plateau once their chunk length exceeds the RL training budget.
  • Resource Advantage: Constant KV-cache usage and linear compute scaling enable cost-effective training and inference, making long-chain reasoning practical on contemporary hardware.

These results establish that restricting context to a fixed-size Markovian state does not degrade—and may even enhance—reasoning performance on complex tasks.

5. Theoretical and Practical Significance

The Markovian thinking design has multiple theoretical and practical implications:

  • Compact Sufficient State: The paradigm compels the model to learn to store all task-relevant information in a minimal, machine-readable carryover, which guarantees sufficiency for future reasoning steps and offers interpretability.
  • Compatibility with Sparse/Linear Attention: Since only a fixed-size window is ever attended, Markovian thinking is inherently compatible with memory- and compute-efficient attention mechanisms.
  • Environment Redesign as a Lever: The results highlight that environment redesign—specifically, constraining the information flow in the RL process—enables efficient scaling and practical long-horizon reasoning.
  • Transferable to Non-Language Domains: While demonstrated in LLM reasoning, the essential pattern—dividing long sequential tasks into locally Markovian segments—may apply to any RL problem where context or trajectory size poses computational bottlenecks.

6. Algorithmic Summary

The core Markovian thinking loop in Delethink is as follows:

  1. Initialize: Set x1=queryx_1 = \text{query}.
  2. For l=1,2,...,Ll = 1, 2, ..., L (until EOS or cap reached):
    • Generate up to C\mathcal{C} tokens (reasoning chunk) from xlx_l.
    • If no EOS: set xl+1=query(l)[m:]x_{l+1} = \text{query} \oplus (\ell_l)[-m:]
    • Clear and re-prime the KV cache.
    • Continue.

The policy is trained via RL with chunk-local and global reward, ensuring optimal information is preserved in each carryover.

7. Broader Impact and Future Research Directions

Markovian thinking shifts the focus of scalable reasoning model development from architectural sophistication to environment constraints. By demonstrating that chunked, Markovian states suffice for long-chain reasoning without accuracy loss or prohibitive resource costs, the approach opens avenues for:

  • Extending models to multimodal tasks or domains with intrinsic long-term dependencies.
  • Jointly optimizing model architectures and RL environment designs for ultra-long sequence applications.
  • Exploring alternative state compression schemes or summary mechanisms at chunk boundaries.

Future work may focus on adaptive carryover mechanisms, hybrid Markov-non-Markov strategies, or integration with retrieval or external memory systems for even longer-horizon inference.


In summary, Markovian thinking reframes long-context reasoning in LLMs as a process of sequentially generating fixed-size, sufficient states, enabling high-performance, scalable, and efficient RL-based reasoning, as demonstrated in Delethink (Aghajohari et al., 8 Oct 2025). This framework holds substantial promise for the next generation of scalable reasoning systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Markovian Thinking.