Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

The Markovian Thinker (2510.06557v1)

Published 8 Oct 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

Summary

  • The paper demonstrates that segmenting reasoning into fixed-size chunks with a Markovian state enables linear scaling in compute and constant memory usage.
  • It employs Delethink, a specialized RL environment that resets context at chunk boundaries, ensuring efficient long-horizon reasoning irrespective of token count.
  • Empirical tests on math benchmarks show that Delethink outperforms traditional LongCoT methods, enabling scaling to reasoning traces of over 96K tokens.

The Markovian Thinker: Efficient Long-Horizon Reasoning via Markovian RL Environments

Introduction and Motivation

The paper introduces the Markovian Thinking paradigm for training LLMs to perform long-horizon reasoning with linear compute and constant memory, addressing the quadratic scaling bottleneck inherent in standard chain-of-thought (CoT) reinforcement learning (RL) environments. In the conventional LongCoT RL setup, the model's state is the concatenation of the prompt and all previously generated reasoning tokens, resulting in unbounded context growth and quadratic attention cost as the reasoning trace lengthens. This severely limits the practical scaling of reasoning length, both during RL training and inference.

The Markovian Thinking paradigm proposes a fundamental shift: the RL environment is redefined so that the policy (the LLM) always conditions on a constant-size state, regardless of the total reasoning length. This decouples the length of the reasoning trace from the context size, enabling efficient scaling to much longer chains of thought.

Delethink: Markovian RL Environment

The paper instantiates Markovian Thinking with Delethink, a novel RL environment that structures reasoning into a sequence of fixed-size chunks. Within each chunk, the model generates as usual, but at each chunk boundary, the environment resets the context to a fresh prompt containing the original query and a short carryover (the "Markovian state") from the previous chunk. The policy must learn, via RL, to encode all necessary information for seamless continuation into this bounded textual state.

This approach enforces a hard cap on the model's context size at every step, making the compute and memory requirements independent of the total reasoning length. The transition dynamics of the environment are modified such that, at chunk boundaries, only the carryover state is preserved, and all previous tokens are deleted from the context.

(Figure 1)

Figure 1: Delethink redefines the RL environment as a chunked, Markovian process, with context resets at chunk boundaries and only a short carryover state, in contrast to LongCoT's ever-growing context.

Theoretical and Empirical Analysis of Compute Scaling

The authors provide a detailed analysis of the computational complexity of Delethink versus LongCoT. In LongCoT, both training and inference scale quadratically with the number of thinking tokens due to the growing context. In Delethink, the cost is quadratic in the chunk size but only linear in the number of chunks, yielding overall linear scaling with reasoning length.

Empirical results confirm these predictions: for a 1.5B parameter model, training to an average thinking length of 96K tokens requires 27 H100-months with LongCoT-RL, but only 7 H100-months with Delethink. The per-GPU throughput remains constant as the thinking budget increases in Delethink, while it degrades linearly in LongCoT due to memory growth. Figure 2

Figure 2: Delethink matches and surpasses LongCoT-RL in accuracy during RL training while using less compute; only Delethink continues to improve as the thinking budget is scaled beyond training limits.

Figure 3

Figure 3: Computational profiles show Delethink's linear scaling in FLOPs and constant memory, compared to LongCoT's quadratic scaling.

Empirical Results: Reasoning Performance and Test-Time Scaling

Delethink is evaluated on competition-level math benchmarks (AIME'24, AIME'25, HMMT'25), as well as out-of-distribution tasks (GPQA-Diamond, LiveCodeBench). With a 24K thinking-token budget, Delethink matches or outperforms LongCoT-RL 24K, and significantly outperforms LongCoT-RL 8K, confirming the necessity of extended reasoning.

A key result is that Delethink enables genuine test-time scaling: when allowed to reason with more tokens at inference than seen during training, Delethink continues to improve, while LongCoT-RL plateaus at its training limit. For some problems, solutions only emerge after reasoning with 100K+ tokens, despite the model being trained with a 24K budget. Figure 4

Figure 4: On IID math tasks, Delethink outperforms LongCoT-RL 24K; on OOD tasks, Delethink matches or slightly beats LongCoT-RL 24K. Delethink's throughput remains constant as thinking scales.

Figure 5

Figure 5: Only Delethink continues to improve as the thinking budget is scaled at inference, while LongCoT-RL plateaus at its training limit.

Scaling to Extreme Reasoning Lengths

The linear compute scaling of Delethink enables training with much larger thinking budgets. The authors demonstrate Delethink with a 96K token budget, achieving mean trace lengths up to 42K and surpassing both the baseline and test-time-extended Delethink 24K. This is achieved with only 150 RL steps, highlighting the practical feasibility of scaling reasoning length by an order of magnitude. Figure 6

Figure 6: Delethink 96K achieves higher accuracy and longer average trace lengths than Delethink 24K, demonstrating scalability to extreme reasoning lengths.

Markovian Tracing in Off-the-Shelf LLMs

A critical finding is that off-the-shelf reasoning LLMs (R1-Distill 1.5B–14B, GPT-OSS 120B, Qwen3-30B-A3B) already exhibit strong Markovian behavior zero-shot: when Delethink Tracing is applied at initialization (without any additional training or prompting), these models recover most of their LongCoT performance, and in some cases even surpass it. This provides a strong initialization for RL and suggests that Markovian Thinking is a natural fit for current SOTA models. Figure 7

Figure 7: Delethink Tracing at initialization recovers most of LongCoT performance, indicating Markovian behavior in the base model.

Figure 8

Figure 8: SOTA LLMs (GPT-OSS-120B, Qwen3-30B-A3B) are capable of Markovian Thinking zero-shot, providing a strong initialization for Delethink training.

Ablations: Context and State Size

The paper includes extensive ablations on the per-chunk context size (C\mathcal{C}) and the Markovian state size (mm). Delethink is robust to reductions in context size down to 4K, with only modest performance degradation. However, at 2K, models struggle to terminate within budget and accuracy drops, indicating a lower bound on effective context size. The Markovian state size can be reduced significantly (e.g., mCm \ll \mathcal{C}) with little impact on accuracy for R1-Distill models, but larger models (Qwen3) benefit from larger state sizes, especially on long-trace tasks. Figure 9

Figure 9: Ablation of the Markovian state size mm at fixed per-chunk context C=8k\mathcal{C}=8\text{k}; R1-Distill models are robust to small mm, while Qwen3 benefits from larger mm.

Figure 10

Figure 10: Scaling behavior of R1-Distill models under varying per-chunk contexts; 4K and 8K are effective, while 2K severely limits performance.

Limitations and Stress Tests

Delethink's Markovian approach is less effective on tasks that require persistent, large-scale memory, such as crossword puzzles where previously found words must be retained. In these stress tests, Delethink remains competitive but its zero-shot limits are evident, as the bounded state cannot always encode all necessary information. Figure 11

Figure 11: On CrossWordBench, Delethink remains competitive but its zero-shot limits are evident due to the need for persistent memory.

Implementation and Practical Considerations

Delethink is implemented as a simple wrapper around standard transformer inference, requiring only that the context be reset at chunk boundaries and a short carryover state be appended. The approach is architecture-agnostic and can be combined with any attention variant (full, sliding, streaming). The RL objective is a straightforward adaptation of PPO to the chunked environment, with per-chunk loss normalized by the total number of thinking tokens.

A sample implementation using HuggingFace Transformers is provided, demonstrating the ease of integration into existing LLM stacks.

Implications and Future Directions

The Markovian Thinking paradigm, as instantiated by Delethink, demonstrates that long-horizon reasoning can be achieved with linear compute and constant memory, without architectural modifications. This opens the door to training and deploying reasoning LLMs that can think for hundreds of thousands or millions of tokens, previously infeasible due to quadratic scaling.

The results suggest that non-quadratic sequence architectures (e.g., linear attention, state-space models) may be particularly well-suited for reasoning tasks, as the Markovian environment decouples reasoning length from context size. The strong zero-shot Markovian behavior in current SOTA LLMs indicates that further scaling and optimization in this direction is both practical and promising.

Conclusion

The Markovian Thinker framework redefines the RL environment for reasoning LLMs, enabling efficient, scalable long-horizon reasoning by enforcing a constant-size state via chunked generation and explicit state passing. Delethink matches or surpasses standard LongCoT-RL in accuracy, achieves substantial compute savings, and uniquely enables test-time scaling far beyond training limits. The paradigm is robust, simple to implement, and compatible with existing LLM architectures. These results have significant implications for the future of scalable reasoning in LLMs, suggesting a path toward models capable of sustained, efficient, and arbitrarily long chains of thought.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

1. What is this paper about?

This paper is about teaching AI LLMs to “think” for a long time without slowing down too much. Today, many models solve hard problems by writing long chains of thoughts before giving an answer. But the usual way of doing this gets very slow and expensive as the thoughts get longer. The authors introduce a new way, called Markovian Thinking, that lets the model think for a very long time while keeping the cost much lower.

2. What questions are they asking?

  • Can we redesign the “thinking setup” so that a model can think longer without getting much slower?
  • Can a model learn to carry its plan forward using only a short reminder, instead of re-reading everything it wrote before?
  • Will this new setup keep accuracy high (or even improve it) while using less computer power?
  • Do current models already show signs of this behavior naturally?

3. How did they do it? (Methods explained simply)

Think of the model’s reasoning like writing in a very long notebook:

  • Old way (LongCoT): The model keeps everything it has written on one never-ending page. Each new sentence has to “look back” at the entire page. The longer the page, the slower things get—slowing down faster than just “more lines,” because checking all pairs of lines takes a lot of work.
  • New way (Markovian Thinking with “Delethink”): The model writes in pages (chunks) of the same size. At the end of each page, it writes a short summary—like a sticky note with the key points. Then it starts a fresh page using only the original question plus that short sticky note. It does not carry over the full previous page, just the summary.

Key idea: The model learns (using reinforcement learning, a training method where the model gets rewards for good outcomes) to write the right kind of short summary at each page end so it can pick up smoothly on the next page. Because it only needs to read a fixed-size page plus a short summary, the amount of work grows in a straight line with the number of pages instead of exploding.

Simple analogy:

  • LongCoT = one giant scroll you reread constantly.
  • Delethink = a notebook with pages; you only need the question and a sticky-note summary to continue.

“Markovian” here means the model only needs what’s in the current state (the question + short carryover), not the entire past, to keep reasoning correctly.

4. What did they find, and why is it important?

Main results (in plain terms):

  • Same or better accuracy with less compute: A model trained with Delethink (using 8K-token pages) can “think” up to 24K tokens and match or beat the usual long-thinking method that was trained to handle all 24K at once. That’s strong performance with less cost.
  • Keeps improving when allowed to think longer: When you let the Delethink-trained model think for even more tokens at test time, it keeps getting better, while the old method stops improving (plateaus).
  • Big speed and cost savings: For very long thinking (around 96,000 tokens), the old way was estimated to cost about 27 “H100-months” (roughly a month of work on a very powerful GPU). Delethink needed about 7 H100-months for the same job. That’s roughly a 4× reduction in cost.
  • Many models already show this behavior: Even before training with Delethink, several existing reasoning models sometimes naturally produce “Markovian” traces (they can continue well with short summaries). That makes training with Delethink easier and more effective.

Why it matters:

  • Lower cost and faster training/inference means we can build smarter systems without needing huge budgets.
  • Being able to think longer helps with hard tasks like math, coding, and puzzles that need many steps.

5. What’s the impact?

This work shows that changing the “thinking environment” can be just as important as changing the model itself. By decoupling “how long the model thinks” from “how much it has to re-read,” models can, in principle, think for millions of tokens without the usual slowdown. This could:

  • Enable much deeper reasoning on tough problems.
  • Reduce the energy and money required to train and use big models.
  • Work well with future architectures built for efficiency.
  • Make long, step-by-step problem solving practical for real-world uses (education, research, programming, science).

In short, the paper’s idea—think in fixed-size chunks with smart summaries—lets AI think longer, better, and cheaper.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of concrete gaps and open questions that remain unresolved and can guide future research:

  • Task generality beyond math-centric evaluation:
    • How well Markovian Thinking transfers to non-math domains (e.g., long-form reasoning in law, scientific QA, open-domain multi-hop retrieval, dialog planning) remains unclear.
    • Evidence for GPT-OSS 120B is anecdotal in Sec. “why works”; systematic, broad benchmarks (code, planning, commonsense, long documents) are not reported.
  • Limits of the Markovian assumption:
    • Which tasks can/cannot be compressed into a bounded textual state without loss? Is there a principled characterization (e.g., lower bounds on necessary state size m for different problem classes)?
    • Failure modes when key early information cannot be faithfully reconstructed from “last m tokens.”
  • Carryover design and sufficiency:
    • The choice of “last m tokens” as carryover is ad hoc; no analysis of optimal m, sensitivity, or adaptive strategies exists.
    • No comparison to learned summaries, key-value compression, or structured state representations as alternatives to “last m tokens.”
  • Chunking policy and boundaries:
    • Chunk size C and fixed boundaries are hand-chosen; it is unknown whether dynamic chunking, learned reset points, or semantic-aligned boundaries reduce error propagation.
    • Effects of placing boundaries mid-derivation/sentence are unquantified.
  • Error accumulation across chunks:
    • No diagnostics quantify how inconsistencies in the textual state compound with chunk count S, nor how often re-summarization self-corrects vs. drifts.
    • Lack of per-boundary error analyses (e.g., accuracy drop conditional on number of resets).
  • Reward design and credit assignment:
    • With a terminal, trajectory-level reward, how does credit propagate across resets, especially when early chunk states cause later failures?
    • Sensitivity to advantage estimators, variance reduction strategies, and the effect of per-chunk auxiliary rewards is unstudied.
  • Comparison fairness and confounders:
    • It’s unclear if Delethink’s gains stem from architectural/algorithmic advantages or simply from being able to train with larger effective batch/longer horizons due to shorter contexts.
    • Details on matching KL schedules, temperatures, sampling schemes, and hyperparameters between LongCoT-RL and Delethink are insufficient for airtight causal attribution.
  • Theoretical guarantees:
    • No formal result shows that any LongCoT-computable solution has a bounded-state Markovian equivalent for reasonable m, nor bounds on performance loss vs. m and C.
    • Absence of sample-complexity or convergence analysis tailored to chunked, resetting environments.
  • Test-time scaling limits:
    • While Delethink improves beyond the trained thinking budget, the point at which returns saturate or degrade (and why) is unknown.
    • Robustness of very long traces (e.g., 100K–1M tokens) under distribution shift from training remains untested.
  • Interplay with architecture choices:
    • Claims that linear-time architectures (e.g., Mamba, linear attention) could particularly benefit are unverified; no experiments evaluate Delethink with non-quadratic sequence models.
    • How recurrence/state-space models should interface with textual state resets is unexplored.
  • Multi-turn, tool-use, and retrieval settings:
    • How to retain tool outputs, retrieval evidence, or program states across resets is unspecified.
    • Whether the textual state can faithfully “carry” external-tool context without repeated recomputation or loss is an open question.
  • Multimodal applicability:
    • The approach is text-only; how to compress image/audio/video features into a bounded “state” across resets is unaddressed.
  • Safety, privacy, and CoT exposure:
    • Forcing models to externalize internal state may increase chain-of-thought leakage; no mechanisms to control or redact sensitive state are proposed.
    • Impacts on refusal behaviors and safety alignment under Markovian Thinking are unstudied.
  • Latency and systems overhead:
    • Frequent resets re-send the query and state; wall-clock latency, token I/O, and backend overheads vs. LongCoT are not rigorously benchmarked.
    • Interaction with KV-cache reuse and streaming inference is unclear (potentially lower cache benefits due to resets).
  • Interpretability of learned state:
    • The content and structure of the learned “state near end-of-chunk” is not analyzed (e.g., are they concise summaries, scratchpads, or opaque markers?).
    • Methods for constraining or auditing the state (format, faithfulness, verifiability) are absent.
  • Robustness and calibration:
    • No robustness tests under adversarial prompts, noisy inputs, or distractors evaluate whether the state maintains fidelity through resets.
    • Effects on calibration and uncertainty estimates as S grows are unknown.
  • Early stopping and termination behavior:
    • How the model learns reliable stopping criteria under periodic resets (avoiding redundant or looping states) is not studied.
    • Potential interactions with early-exit or pruning strategies in a Markovian environment are not explored.
  • Data efficiency and stability:
    • Whether Delethink requires fewer/more RL steps or exhibits different instability modes than LongCoT-RL is not quantified beyond coarse compute estimates.
    • Ablations on group size G, iterations cap I, KL strength, and reference policy drift are missing.
  • Generalization across scales:
    • RL training is demonstrated at 1.5B; it is unknown whether the same training dynamics, stability, and gains hold at 7B–70B+ under realistic compute constraints.
    • The observed zero-shot Markovian behavior at 120B is not accompanied by RL training at that scale.
  • Alternative state-carryover mechanisms:
    • No exploration of hybrid schemes (e.g., compact learned memory tokens, differentiable external memory, or lossy KV compression) vs. textual carryover.
    • Potential benefits of explicitly supervised state summaries (distilled from high-quality proofs) are untested.
  • Evaluation breadth and metrics:
    • Evaluations emphasize accuracy; there is little on reasoning quality measures (faithfulness, step validity), human-judged coherence across boundaries, or error typology specific to resets.
    • No standardized benchmarks tailored to chunked reasoning are introduced to stress-test state sufficiency.
  • Reproducibility of compute claims:
    • Cost estimates rely on a particular stack (VERL + SGLang) and hardware; sensitivity to implementation, kernel choices, scheduler policies, and parallelism strategy is not provided.
    • Hidden constants (e.g., prompt reconstruction, logging, rollout orchestration) may affect real-world linear-vs-quadratic break-even points.
  • Interaction with alignment objectives:
    • Effects of KL-to-reference and instruction-following regularization on the learned state style (verbosity, redundancy, safety) are unclear.
    • Potential trade-offs between compact state and alignment constraints are not examined.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

Below are actionable applications that can be deployed now, leveraging the paper’s Markovian Thinking paradigm and the Delethink environment (code and weights provided in the paper).

  • Cost-reduced RL fine-tuning for reasoning LLMs (software/AI platforms)
    • Train math/coding reasoners with linear compute and constant memory by replacing LongCoT environments with Delethink in RLVR pipelines.
    • Tools/Workflows: Integrate Delethink step (Algorithm 1) into existing verl/SGLang, vLLM, or Ray training stacks; introduce a “Markovian RL” flag in training configs.
    • Assumptions/Dependencies: Verifiable reward functions; appropriate chunk size C and carryover m; reference policy and KL regularization remain stable.
  • Sliding-window “Markovian inference mode” for existing LLMs (software, cloud AI)
    • Serve long chains of thought by chunking generation and carrying over only the last m tokens; exploit observed zero-shot Markovian traces in off-the-shelf models.
    • Tools/Workflows: Inference server middleware that resets context at chunk boundaries and preserves a small carryover; cost-based billing per chunk.
    • Assumptions/Dependencies: Tasks tolerate local, textual state handoff; small m preserves coherence for that domain.
  • Enterprise copilots with bounded-memory long reasoning (enterprise software)
    • Conduct extended analyses (e.g., compliance checks, multi-step report synthesis, strategy plans) with predictable compute and memory.
    • Tools/Workflows: “State budgeter” that enforces a carryover quota; chunked trace logging for auditability and post-mortems.
    • Assumptions/Dependencies: Users allow chain-of-thought storage when appropriate; summaries must capture critical requirements at each boundary.
  • Code generation and refactoring at scale (software engineering)
    • Break multi-file refactors and complex bug traces into chunks with explicit textual state; reduce KV-cache growth and costs in CI/CD agents.
    • Tools/Workflows: IDE plugin that inserts “state notes” at chunk ends; CI bots that checkpoint reasoning before tool calls.
    • Assumptions/Dependencies: High-fidelity state messages near boundaries; integration with repository context and tests.
  • Retrieval-augmented “state carryover” (RAG + reasoning)
    • Combine small textual carryover with retrieval of previously emitted state notes for longer-horizon tasks without expanding context.
    • Tools/Workflows: Lightweight ephemeral memory store indexed by task/thread; retrieval hooks only for past state notes, not full histories.
    • Assumptions/Dependencies: Retrieval quality; guardrails to avoid reintroducing quadratic context.
  • Cost-aware long-form analysis for finance and legal (finance, legal tech)
    • Run deep diligence or contract review in bounded chunks; produce auditable, sectioned rationales.
    • Tools/Workflows: “Markovian trace inspector” that highlights state transitions and checks for missing obligations across chunks.
    • Assumptions/Dependencies: Coverage risks if crucial early facts aren’t summarized into the carryover; validation workflows required.
  • Education and tutoring with stepwise scaffolding (education)
    • Tutors show solutions in chunks, teaching students to summarize state; lowers infra costs for long derivations.
    • Tools/Workflows: Classroom dashboards that visualize per-chunk reasoning and state notes.
    • Assumptions/Dependencies: Rewards/verifiers for correctness; pedagogical guardrails for chain-of-thought disclosure.
  • Energy- and cost-efficient cloud offerings (cloud/ML ops)
    • Offer a “Markovian Thinking API” tier with lower GPU-time/carbon per long-reasoning session.
    • Tools/Workflows: Usage meters per chunk instead of per token; autoscalers tuned for linear scaling.
    • Assumptions/Dependencies: Serving stack supports reset-and-carryover; observability of chunk boundaries.
  • Safer long-horizon interactions via state budgeting (AI safety/compliance)
    • Reduce propagation of prompt injections and sensitive data by limiting carryover; inspect each state handoff for policy violations.
    • Tools/Workflows: “State budget policy” that caps m and scans carryover text; chunk-level compliance checks.
    • Assumptions/Dependencies: Safety filters strong enough; some tasks may need retrieval of vetted state to avoid information loss.
  • On-device or low-memory deployments for extended planning (mobile/edge, robotics prototyping)
    • Execute long reasoning on constrained devices by chunking and re-prompting with small carryover.
    • Tools/Workflows: Mobile inference wrappers with sliding-window generation; robot planners that serialize objectives into compact textual state.
    • Assumptions/Dependencies: Latency budget accommodates chunk boundaries; careful choice of C and m for task complexity.

Long-Term Applications

These concepts require further research, scaling, or validation (e.g., safety and regulatory testing), but are enabled by the paper’s paradigm.

  • Million-token deliberation without quadratic overhead (software/AI research)
    • Push long-horizon reasoning (theorem proving, formal verification, program synthesis) far beyond current context limits.
    • Tools/Workflows: Curriculum schedules that teach robust state formation; synthetic environments with rewards for correct long-range continuation.
    • Assumptions/Dependencies: Stable learning of state messages; stronger verifiers/reward models for non-verifiable domains.
  • Healthcare decision support with bounded-state reasoning (healthcare)
    • Clinical case analysis across lengthy records using minimal carryover and rigorously validated state notes; privacy-friendly by default.
    • Tools/Workflows: EHR interfaces that render chunked rationales and require clinician sign-off per chunk; state validation against guidelines.
    • Assumptions/Dependencies: Regulatory clearance; clinical safety audits; guarantees that carryover captures all critical patient context.
  • Markovian Reasoning for robotics and autonomy (robotics)
    • Long-horizon task planning and execution where policies persist via compact textual or latent state across planning cycles.
    • Tools/Workflows: Hybrid planners that combine Delethink-style textual state with symbolic task graphs; recovery after resets/failures.
    • Assumptions/Dependencies: Robustness to partial observability; integration with perception/action loops; safety and verification.
  • Compute-governed and carbon-aware AI standards (policy)
    • Establish benchmarks and procurement standards favoring linear-compute reasoning; define “state carryover caps” to limit data retention.
    • Tools/Workflows: Compliance attestations reporting H100-months/GPU-hours saved per task; carbon budgets tied to chunked reasoning profiles.
    • Assumptions/Dependencies: Agreement on measurement protocols; third-party audits.
  • Architectures optimized for Markovian Thinking (ML systems)
    • Co-design with state-space models or linear attention to further cut inference/memory; recurrent “reasoning cores” trained on Delethink-like environments.
    • Tools/Workflows: Training recipes that align chunk boundaries with recurrent state updates; new benchmarks isolating state quality.
    • Assumptions/Dependencies: Stable optimization; equivalent or better accuracy than attention-dominant stacks.
  • Formal guarantees and verifiable state contracts (assurance)
    • Specify and verify properties of the carryover (e.g., “must include constraints X,Y”) to ensure no critical info is dropped between chunks.
    • Tools/Workflows: State contract checkers; proof-carrying state messages; conformance tests across chunk boundaries.
    • Assumptions/Dependencies: Domain-specific ontologies; hybrid symbolic-LLM methods.
  • Privacy-preserving, session-spanning assistants (daily life, enterprise)
    • Assistants that reason over weeks/months via small, signed state snapshots rather than storing raw histories.
    • Tools/Workflows: Encrypted, minimal state vaults; user controls to review/edit carried state between sessions.
    • Assumptions/Dependencies: UX for state inspection; strong privacy guarantees and revocation.
  • Multi-agent Markovian collaboration (agents)
    • Agents exchange compact state messages to coordinate large tasks (software projects, data pipelines) without shared massive contexts.
    • Tools/Workflows: Protocols for state message passing; arbitration/judging agents that validate handoffs.
    • Assumptions/Dependencies: Communication reliability; standards for state schemas.
  • Domain-specific “state compilers” (tooling)
    • Compilers that transform intermediate reasoning into minimal, loss-aware state tailored to domains (law, finance, science).
    • Tools/Workflows: Learned or rule-based reducers that emit carryover tailored to domain constraints.
    • Assumptions/Dependencies: Adequate domain coverage; ability to quantify information loss vs. downstream task impact.
  • Safety cases for chain-of-thought governance (policy/safety)
    • Define when and how chunked CoT can be logged, displayed, or redacted; set norms for teaching vs. concealed CoT.
    • Tools/Workflows: Policy templates that tie CoT visibility to user roles and risk tiers; chunk-level redaction tools.
    • Assumptions/Dependencies: Cross-stakeholder consensus; alignment with jurisdictional regulations.

Notes on feasibility across applications:

  • The Markovian approach assumes tasks admit a compact “textual state” that suffices for continuation; when long-range rare facts matter, retrieval or verified state augmentation may be necessary.
  • Empirical gains depend on well-chosen chunk size C and carryover m; too small m risks losing critical context; too large m erodes efficiency.
  • Safety-critical deployments require rigorous evaluation because chunking changes failure modes (e.g., omissions at boundaries).
  • Compute savings materialize most when attention cost dominates; synergy with linear/recurrent architectures may amplify benefits over time.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Advantage estimator: A method to compute per-trajectory or per-token advantages for credit assignment in policy-gradient RL. "off-the-shelf advantage estimator"
  • Attention-based policies: Policies implemented with self-attention that read growing contexts, incurring quadratic costs with length. "For attention-based policies this entails prohibitive, quadratic growth of compute throughout."
  • Attention-sink tokens: Special tokens retained in streaming contexts to stabilize attention quality. "preserves a small set of attention-sink tokens to stabilize quality under sliding windows."
  • Autoregressively: Generating each token conditioned on previously generated tokens in sequence. "autoregressively: πθ()\sim \pi_\theta(\cdot \mid )"
  • Carryover: A brief portion of prior reasoning copied into the next chunk to maintain continuity. "a short carryover from the previous chunk."
  • Chain-of-thought (LongCoT): Extended intermediate reasoning before answering, represented by many “thinking tokens.” "long chains of thought (LongCoT)"
  • Chunked: Organizing generation into fixed-size segments to bound context. "chunked, markovian process"
  • Constant memory: Memory usage that does not grow with the total thinking length. "linear compute with constant memory"
  • Constant-size state: A bounded input state fed to the policy regardless of total reasoning length. "conditioning on a constant-size state"
  • Expected return: The RL objective of maximizing the expectation of rewards over trajectories. "maximize expected return"
  • FLOP: The count of floating-point operations used as a compute metric. "FLOP"
  • H100-months: A compute cost unit estimating wall-clock usage of NVIDIA H100 GPUs. "LongCoT-RL costs 27 H100-months vs.\ 7 for Delethink."
  • In-distribution: Behavior occurring within the distribution the model commonly exhibits or is trained on. "has strong support for Markovian Thinking in-distribution"
  • KL coefficient: The scalar weight on the KL regularization term in the RL objective. "β\beta is the KL coefficient"
  • KL penalty: Regularization via the Kullback–Leibler divergence to keep the policy close to a reference. "a KL penalty against a reference model"
  • Linear attention: Attention variants whose time/memory scale linearly with sequence length. "kernel-based linear attention"
  • Linear compute: Total computation that grows linearly with thinking length. "yields linear compute with constant memory"
  • LongCoT environment: An RL thinking setup where the context grows by concatenating all prior reasoning tokens. "the LongCoT environment concatenates tokens indefinitely"
  • Markov property: The property that the future depends only on the current state, not the full history. "satisfies the Markov property"
  • Markovian Decision Process (MDP): The formal RL framework defining states, actions, and transitions for language generation. "language-generation Markovian Decision Process (MDP)"
  • Markovian Thinker: A policy trained to maintain a textual state and reason across chunks with bounded context. "We call such a policy a Markovian Thinker."
  • Markovian Thinking: A paradigm where reasoning advances while conditioning on a bounded state. "We propose Markovian Thinking, a paradigm"
  • Policy-gradient updates: RL optimization that updates the policy using gradients of expected returns. "applying policy-gradient updates"
  • Reference model: The fixed model used as the target in KL regularization. "against a reference model"
  • Reference policy: The baseline distribution the trained policy is regularized toward via KL. "πref\pi_\text{ref} is the reference policy"
  • Reinforcement Learning from Verifiable Rewards (RLVR): An RL formulation where rewards are derived from automatically checkable signals. "We adopt the RLVR formulation"
  • Self-attention: The transformer mechanism computing token interactions via attention over the sequence. "replaces self-attention with state-space models"
  • Sequential sampling: An evaluation procedure reporting performance when outputs are sampled sequentially. "reported using sequential sampling"
  • Sliding windows: A moving context window used to bound attention during long generation. "under sliding windows"
  • State-space models: Sequence models with recurrent state updates enabling linear-time, constant-memory generation. "state-space models"
  • Test-time compute: The amount of computation used during inference rather than training. "test-time compute (reported using sequential sampling)"
  • Test-time scaling: Increasing inference-time thinking or compute budget to boost accuracy. "With test-time scaling, Delethink continues to improve"
  • Token-by-token generation MDP: An MDP where each action is the next token and the state is the prompt plus generated tokens so far. "token-by-token generation MDP (as used in standard LongCoT RLVR)"
  • Trajectory-level reward: A reward assigned to the entire generated sequence rather than individual steps. "trajectory-level reward function (verifiable rewards in our case)"
  • Verifiable rewards: Rewards computed by checking outputs with automatic or programmatic criteria. "verifiable rewards in our case"
  • Zero-shot: Achieving behavior without additional training or specialized prompts. "samples Markovian traces zero-shot"
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 37 posts and received 1048 likes.

alphaXiv

  1. The Markovian Thinker (52 likes, 0 questions)