Papers
Topics
Authors
Recent
2000 character limit reached

Deep Self-Evolving Reasoning (DSER)

Updated 22 October 2025
  • DSER is a probabilistic, iterative framework where models enhance answers via self-generated verification, modeled as a two-state Markov process.
  • It employs parallel long-horizon rollouts to mitigate weak verification in compact open-weight models, significantly boosting problem-solving accuracy.
  • Empirical results show that DSER enables smaller models to tackle challenging benchmarks and even outperform larger teacher models through majority-vote aggregation.

Deep Self-Evolving Reasoning (DSER) refers to a probabilistic and algorithmic paradigm in which a reasoning system—most often a LLM or deep network—iteratively improves its solutions by self-verification and refinement. Unlike static reasoning or fixed chain-of-thought approaches, DSER operates over long horizons, treating each reasoning/refinement step as a stochastic process, typically formalized as a Markov chain in the solution space. This paradigm enables convergence towards correct answers even for highly complex tasks, provided the probability of iterative improvement exceeds that of degradation. DSER has been shown to substantially enhance the capabilities of compact, open-weight models, extending their problem-solving range well beyond their single-turn accuracy and providing both a practical tool and a diagnostic lens for limitations in current self-verification and refinement processes (Liu et al., 20 Oct 2025).

1. Probabilistic Foundation and Markov Chain Formalism

DSER conceptualizes the reasoning process as a discrete-time Markov chain, where each step is a potentially stochastic transition between solution states—“Correct” (C) and “Incorrect” (I). The fundamental assumption is that, after each reasoning attempt or refinement, the new state depends only on the present (not the full history), consistent with Markovian dynamics. Specifically, for a reasoning problem posed as qq, let s(0)s^{(0)} denote the initial candidate solution, typically produced by an LLM: s(0)=R(LLM)(q)s^{(0)} = \mathcal{R}^{(\text{LLM})}(q) At each iteration nn, this candidate is subject to verification (e.g., via an LLM-generated critique) and then refinement, yielding s(n+1)s^{(n+1)}: v(n)=R(LLM)([q;s(n);pv]) s(n+1)=R(LLM)([q;s(n);pv;v(n);pr])v^{(n)} = \mathcal{R}^{(\text{LLM})}([q; s^{(n)}; p_v]) \ s^{(n+1)} = \mathcal{R}^{(\text{LLM})}([q; s^{(n)}; p_v; v^{(n)}; p_r]) Transitions between “Correct” and “Incorrect” form a two-state Markov process with transition probabilities pICp_{\text{IC}} (I \to C) and pCIp_{\text{CI}} (C \to I), yielding the transition matrix: P=[1pCIpCI pIC1pIC]P = \begin{bmatrix} 1-p_{\text{CI}} & p_{\text{CI}} \ p_{\text{IC}} & 1-p_{\text{IC}} \end{bmatrix} The stationary distribution of being in the correct state is

πC=pICpIC+pCI\pi_C = \frac{p_{\text{IC}}}{p_{\text{IC}} + p_{\text{CI}}}

If pIC>pCIp_{\text{IC}} > p_{\text{CI}}, repeated long-horizon reasoning trajectories will converge to predominantly correct solutions.

2. Iterative Self-Verification and Refinement

A critical distinction in DSER is the explicit modeling of the verification–refinement cycle. Solutions are not accepted at face value; instead, after each generation, an autonomous or model-internal verification phase is triggered (e.g., via a specialized prompt or self-critique module), and the observed deficiencies or errors inform the subsequent refinement step. Even if each individual step is only mildly better than random (i.e., pICpCIp_{\text{IC}} \gtrsim p_{\text{CI}}), the process, through repeated iteration or parallelism, “stochastically wanders towards correctness.” There is no requirement that each verification or refinement is perfect or reliable—only that, in aggregate, improvement occurs with non-zero probability.

This self-evolving dynamic is critical in real-world scenarios where the verification and correction modules in open-weight, smaller-scale LLMs are known to be much weaker than in proprietary or massive models. DSER leverages long-horizon and parallel rollouts to amplify marginally positive tendencies, effectively overcoming high error rates and limited single-step verification capacities.

3. Empirical Results and Test-Time Scaling

The DSER paradigm has been empirically validated on highly challenging mathematical benchmarks such as AIME 2024–2025 (Liu et al., 20 Oct 2025). When applied to the DeepSeek-R1-0528-Qwen3-8B model—a compact, open-weight 8B-parameter LLM—DSER enabled the solution of 5/9 previously unsolvable Olympiad-level problems. By executing many parallel long-horizon DSER rollouts (as many as 80 per hard instance), the framework achieved majority-voting correctness in aggregate, boosting Pass@1 by 6.5–9.0% over the baseline. Notably, when output aggregation via majority vote was employed, the DSER-augmented model exceeded the single-turn accuracy of its original 600B-parameter teacher model.

Evaluation utilized both pointwise and aggregated metrics:

  • Avg@K (Average Accuracy): Mean Pass@1 accuracy over KK runs.
  • Cons@K (Consistency Accuracy): Majority-vote correctness over KK parallel DSER trajectories.

Successful convergence was observed even when individual verification steps were notably noisy, provided the net improvement probability remained positive.

4. Diagnostic and Theoretical Implications

DSER not only extends the practical reasoning capabilities of LLMs but also serves as a diagnostic lens for their intrinsic limitations. Because the stationary distribution πC\pi_C and convergence speed λ2=1pICpCI|\lambda_2| = 1 - p_{\text{IC}} - p_{\text{CI}} are tightly linked to the quality of internal verification and correction, DSER exposes situations where weak verification (high pCIp_{\text{CI}}) or unstable refinement impede performance—even with many rollouts. This provides a pathway to both:

  • Benchmark and compare models in terms of their intrinsic self-evolving tendencies.
  • Motivate research on architectures or training schemes that improve the reliability and stability of internal self-critique modules.

A key implication is that test-time computation and iterative refinement—central to DSER—can partially offset for model scale, allowing compact models to match or surpass larger models in final solution quality when combined with robust self-evolving loops.

5. Limitations and Research Directions

DSER, by construction, assumes nonzero improvement probability—if pICpCIp_{\text{IC}} \leq p_{\text{CI}}, the process cannot reliably converge to correct answers, irrespective of rollout count. In practice, this requires at least a “weakly self-corrective” base model. The effectiveness and efficiency of DSER are also bottlenecked by the current limits of open-weight model verification and correction modules; progress depends on future advances in these components.

Open questions identified include:

  • How best to incentivize and train next-generation models for strong, stable self-verification and correction cycles.
  • How to integrate DSER-inspired iterative reasoning into end-to-end reinforcement learning to reinforce high-quality, self-corrected trajectories.
  • How many iterations or parallel rollouts are required for convergence in various task domains, and how this scales with model quality.

DSER thus serves both as a toolkit for immediate test-time performance gains and as a foundational concept for developing future models imbued with intrinsic, deep self-evolving reasoning capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Deep Self-Evolving Reasoning (DSER).